Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance...

222
Tree Tensor Networks, Associated Singular Values and High-Dimensional Approximation Von der Fakult¨ at f¨ ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Sebastian Kr¨amer, M.Sc aus Freyung Berichter: Univ.-Prof. Dr. rer. nat. Lars Grasedyck Univ.-Prof. Dr. rer. nat. Reinhold Schneider Univ.-Prof. Dr. rer. nat. Markus Bachmayr Tag der m¨ undlichen Pr¨ ufung: 27.04.2020 Diese Dissertation ist auf den Internetseiten der Universit¨ atsbibliothek verf¨ ugbar.

Transcript of Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance...

Page 1: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Tree Tensor Networks, Associated Singular Values and

High-Dimensional Approximation

Von der Fakultat fur Mathematik, Informatik und Naturwissenschaften der RWTH AachenUniversity zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften

genehmigte Dissertation

vorgelegt von

Sebastian Kramer, M.Sc

aus

Freyung

Berichter: Univ.-Prof. Dr. rer. nat. Lars GrasedyckUniv.-Prof. Dr. rer. nat. Reinhold SchneiderUniv.-Prof. Dr. rer. nat. Markus Bachmayr

Tag der mundlichen Prufung: 27.04.2020

Diese Dissertation ist auf den Internetseiten der Universitatsbibliothek verfugbar.

Page 2: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 3: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Abstract

In this thesis, we develop an algebraic and graph theoretical reinterpretation of tensor net-works and formats. We investigate properties of associated singular values and demonstratetheir importance for high-dimensional approximation, in particular for model complexityadaption. This leads us to a concept of stability for iterative optimizations methods whichwe discuss at length for discrete matrix and tensor completion. We further generalize theseideas to the approximate interpolation of scattered data, and demonstrate the potential ofthe introduced algorithms on a data set that describes rolling press experiments. Theselargely algorithmic considerations are supplemented and supported by the theoretical exam-ination of the interrelation between tensor singular values, and its relation to the quantummarginal problem.

Tensor networks are essentially multilinear maps which reflect the connections between col-lections of tensors. In the first part, we discuss how two familiar concepts in mathematicsyield an arithmetic that naturally describes such networks, and which formalizes the un-derlying, simple graph structures through universal assertions. The practicability of thiscalculus is reflected on by the straightforward implementations, which we provide also ofwell known algorithms. As a central theorem of this thesis serves the generalizing tree singu-lar value decomposition, which, while not novel in its basic idea, incorporates various gaugeconditions that stem from different, corresponding tensor formats.

In the second part, we discuss details of high-dimensional, alternating least squares op-timization in tree tensor networks, which are those families of tensors that form tree graphs.Due to the special properties of this class of formats, even high-dimensional problems caneffectively be handled, in particular when the occurring, linear subproblems are solved viaa conjugate gradient method. Subsequent to this introductory segment, we investigate themeaning of singular values in this context. As the model complexity is determined by thetensor ranks of the iterate, the proper calibration of such becomes essential in order to ob-tain reasonable solutions to recovery problems. Based on a specific definition of stability, weintroduce and discuss modifications to standard alternating least squares as well as the rela-tion to reweighted `1-minimization. We in particular demonstrate the use of these conceptsfor rank-adaptive algorithms. Such are further generalized from the discrete to the contin-uous setting, which we apply to the approximate interpolation of rolling press simulations.

As the singular values associated to tensor networks stem from different matricizationsof the same tensor, the question about the interrelation between such arises. In the thirdpart we first show that the tensor feasibility problem is equivalent to a version of the quan-tum marginal problem. While the latter one has been well known in physics for multipledecades, the tensor version originating from mathematics has only recently been considered.We transfer several results into our setting and subsequently utilize the tree singular valuedecomposition in order to decouple high-dimensional feasibility problems into much simpler,smaller ones. Last but not least, we specifically consider this situation for the tensor trainformat, which leads us to cone theory, so-called honeycombs and the application of linearprogramming algorithms.

Page 4: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Zusammenfassung

In dieser Arbeit entwickeln wir eine algebraische und graphentheoretische Neuinterpretationvon Tensor-Netzwerken und Formaten. Wir untersuchen die Eigenschaften der zugehorigenSingularwerte und deren Bedeutung fur hochdimensionale Approximation, insbesondere hin-sichtlich der Anpassung der Modellkomplexitat. Dies fuhrt uns zu einem Konzept der Sta-bilitat fur iterative Optimierungsverfahren, welches wir ausfuhrlich anhand von diskreterMatrix- und Tensorvervollstandigung diskutieren. Ferner verallgemeinern wir diese Ideenbis hin zur approximativen Losung von nicht gleichmaßig strukturierten Datensatzen unddemonstrieren das Potential der vorgestellten Algorithmen anhand von einem Datensatz derWalzvorgange beschreibt. Diese weitgehend algorithmischen Uberlegungen werden erganztund unterstutzt durch die theoretische Untersuchung des Zusammenhangs zwischen Tensor-Singularwerten und dem so genannten quantum marginal problem.

Tensor-Netzwerke sind im Wesentlichen multilineare Abbildungen, die die Zusammenhangeinnerhalb von Mengen von Tensoren darstellen. Im ersten Teil diskutieren wir, wie zweigrundlegende und vertraute mathematische Konzepte eine Arithmetik ergeben, die solcheNetzwerke auf naturliche Art beschreibt, und welche die zugrunde liegenden, einfachengraphentheoretischen Strukturen durch universelle Aussagen formalisiert. Die Praktika-bilitat dieser Konzepte spiegelt sich in den einfachen Implementierungen wider, die auchfur bekannte Algorithmen behandelt werden. Als zentrales Theorem dieser Arbeit dient dieverallgemeinernde Baum-Singularwertzerlegung, die, obwohl nicht neu in ihrer Grundidee,verschiedene Normalisierungsbedingungen von bestimmten Tensor-Formaten vereint.

Im zweiten Teil werden Details der hochdimensionalen, alternierenden Optimierung derkleinsten Quadrate in Baum-Tensor-Netzwerken diskutiert, die gerade solche Familien vonTensoren sind, welche kreisfreie Graphen bilden. Aufgrund der besonderen Eigenschaftendieser Klasse von Formaten konnen auch hochdimensionale Probleme effektiv gelost wer-den, insbesondere wenn die auftretenden, linearen Teilprobleme mit einem Verfahren derkonjugierten Gradienten gelost werden. Im Anschluss zu diesem einfuhrenden Abschnittuntersuchen wir die Bedeutung der Singularwerte in diesem Kontext. Da die Modellkom-plexitat durch die Tensorrange bestimmt wird, wird deren richtige Kalibrierung unerlasslichum korrekte Losungen fur Rekonstruktionsprobleme zu erhalten. Basierend auf einer bes-timmten Definition der Stabilitat fuhren wir Modifikationen fur die gewohnliche, alternieren-den Optimierung der kleinsten Quadrate ein und diskutieren diese, sowie deren Beziehungzu `1-Minimierung. Wir demonstrieren insbesondere den Nutzen dieser Konzepte fur Rangadaptive Algorithmen. Jene werden weiter vom diskreten zum kontinuierlichen Fall verall-gemeinert, welchen wir auf die approximative Interpolation von simulierten Walzvorgangenanwenden.

Da die Singularwerte, die aus den Tensor-Netzwerken hervorgehen, von unterschiedlichenMatrifizierungen desselben Tensors stammen, stellt sich die Frage nach dem Zusammenhangzwischen diesen. Im dritten Teil zeigen wir zunachst, dass das Problem der Realisierbarkeitvon Singularwerten einer Version des quantum marginal problem entspricht. Wahrend let-zteres in der Physik seit mehreren Jahrzehnten bekannt ist, wird die Variante, die aus derMathematik hervorgeht, erst seit recht kurzer Zeit untersucht. Wir ubertragen verschiede-nen Ergebnisse in unseren Sachverhalt und nutzen die Baum-Singularwertezerlegung umdazugehorige, hochdimensionale Probleme in deutlich einfachere und kleinere zu entkop-peln. Schlussendlich betrachten wir insbesondere die Situation fur das tensor train Format,was uns zur Theorie uber Kegel, sogenannten honeycombs sowie der Anwendungen linearerProgrammierung fuhrt.

Page 5: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Acknowledgements

First of all, I would like to thank my advisor Lars Grasedyck for the honest, friendly andencouraging relationship between us, for the freedom he gave me and for the advice forwhich he was always available. I have deeply enjoyed the joint events.

I have always felt invited during conferences and workshops. Thank you Markus Bach-mayr, Sergey Dolgov, Mike Espig, Jochen Garcke, Wolfgang Hackbusch, Daniel Kressner,Lieven De Lathauwer, Anthony Nouy, Ivan Oseledets, Reinhold Schneider, Andre Uschma-jew, Bart Vandereycken and Nick Vannieuwenhoven for the interesting conversations, theinvitations and encouragement to present my research, the joyful dialogs and dinners.

I thank my countless colleagues and friends which I found through research, for the ex-change of knowledge, the joking and beers.

I like to thank in particular everyone in my institute for the daily routines, constant helpful-ness and the social events. Thank you Dima Moser and Tim Werthmann for the feedbackon my thesis, which greatly improved its readability. Thank you Julia Schmitt-Holtermannfor, in addition to the organization and management, the joint production of an estimatedamount of over 2000 cups of cappuccino.

Thank you deeply to all my friends for the trust as well as advice, the weekends, espe-cially the Sundays, and all other time we spent together beyond my doctoral studies.

Thank you to my parents and my brother for their unconditional support.

And, maybe most of all, thank you to my late grandparents who taught me at very youngage that mathematical intuition helps you to win board games. Thank you for the pancakes,pork knuckles and baumkuchen.

Page 6: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 7: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Contents

Abbreviations and Primary Use of Symbols xi

1 Introduction 11.1 A Less Formal Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Tensors as Multivariate Functions . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Overview and Statements of Contribution . . . . . . . . . . . . . . . . . . . . 3

I Tensor Networks and the Tree SVD 7

2 Calculus of Tensor Nodes and Networks 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Motivation of the Tensor Node Arithmetic . . . . . . . . . . . . . . . . 92.1.2 A Brief Impression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Implementation of the Tensor Node Arithmetic . . . . . . . . . . . . . 12

2.2 Elementary Tensor Calculus on Real Hilbert Spaces . . . . . . . . . . . . . . 122.2.1 The Algebraic and Topological Tensor Product . . . . . . . . . . . . . 122.2.2 (Infinite) Singular Value Decomposition and Hilbert-Schmidt Operators 14

2.3 Tensor Node Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.1 Three Fundamental Products and their Graphical Interpretation . . . 162.3.2 Matrix Spaces and their Graphical Interpretation . . . . . . . . . . . . 172.3.3 Multisets and Singleton Domain Functions . . . . . . . . . . . . . . . 182.3.4 Singleton Node Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 192.3.5 Label Set Node Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 212.3.6 Extension of (Multi-)Linear Operators to Tensor Nodes . . . . . . . . 26

2.4 Tensor Node Arithmetic for Function Spaces . . . . . . . . . . . . . . . . . . 272.4.1 Isometries to Finite-Dimensional Hilbert Spaces . . . . . . . . . . . . . 272.4.2 Indexing and Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.3 Unfoldings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4.4 Node Multiplication as Summation and Integration . . . . . . . . . . . 312.4.5 Partial Summation, Trace and Diagonal Operations . . . . . . . . . . 32

2.5 Common Tensor Formats and Decompositions . . . . . . . . . . . . . . . . . . 342.5.1 Tensor Train / Matrix Product States . . . . . . . . . . . . . . . . . . 342.5.2 Tucker / Higher-Order Singular Value Decomposition . . . . . . . . . 342.5.3 Hierarchical Tucker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.5.4 Canonical Polyadic Decomposition . . . . . . . . . . . . . . . . . . . . 352.5.5 Cyclic TT / MPS Format . . . . . . . . . . . . . . . . . . . . . . . . . 362.5.6 Projected Entangled Pair States . . . . . . . . . . . . . . . . . . . . . 36

2.6 Tensor Node Ranks and Decompositions . . . . . . . . . . . . . . . . . . . . . 362.6.1 Ranks of a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vii

Page 8: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

viii CONTENTS

2.6.2 Orthogonality of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6.3 Node SVD and Node QR-Decomposition . . . . . . . . . . . . . . . . . 38

2.7 Nested Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7.1 Mode Label Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7.2 Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.8 Operator Nodes and Complex Hilbert Spaces . . . . . . . . . . . . . . . . . . 42

2.8.1 Linear and Bilinear Functions as Nodes . . . . . . . . . . . . . . . . . 42

2.8.2 Complex Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Tree Tensor Networks 45

3.1 Graphs and Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.1 Corresponding Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.2 Definition of Tree Tensor Networks . . . . . . . . . . . . . . . . . . . . 48

3.2 Orthogonality in Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1 Transitivity Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.2 Orthogonalization of a Tree Tensor Network . . . . . . . . . . . . . . . 50

3.3 Tree Tensor Network Decompositions . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 One-to-One Correspondance of Edges and Subsets of Legs in Trees . . 53

3.3.2 Tree Decomposition and Tree SVD . . . . . . . . . . . . . . . . . . . . 54

3.3.3 Minimal, Nested Subspaces . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4 Tree SVDs for TT and Tucker . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 TT-Tree SVD / Canoncial MPS . . . . . . . . . . . . . . . . . . . . . 60

3.4.2 Tucker Tree SVD / All-Orthogonality . . . . . . . . . . . . . . . . . . 61

3.5 Subspace Projections and Truncations . . . . . . . . . . . . . . . . . . . . . . 62

3.5.1 Nested Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.2 Root-to-Leaves Truncation and Decomposition . . . . . . . . . . . . . 64

3.6 Operations on Tree Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . 68

3.6.1 Normalization of a Tree Tensor Network . . . . . . . . . . . . . . . . . 68

3.6.2 Truncation of Normalized Networks . . . . . . . . . . . . . . . . . . . 69

II High-Dimensional Approximation 71

4 Low-Rank Tensor Minimization Problems and Alternating CG 73

4.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Overview over Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.1 Large-Scale Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.2 Tensor Recovery and Completion . . . . . . . . . . . . . . . . . . . . . 75

4.3 ALS for Linear Equations under Low-Rank Constraints . . . . . . . . . . . . 76

4.3.1 Arbitrary Tree Tensor Networks . . . . . . . . . . . . . . . . . . . . . 78

4.3.2 Branch-Wise Evaluations for Equally Structured Networks . . . . . . . 78

4.4 Alternating CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.1 Comparison of Computational Costs . . . . . . . . . . . . . . . . . . . 81

4.4.2 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.5 ALS for Tensor Recovery and Completion . . . . . . . . . . . . . . . . . . . . 83

4.5.1 Branch-Wise Evaluations for Equally Structured Networks . . . . . . . 83

4.5.2 Tensor Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.6 The Tensor Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . 86

4.6.1 Relation to Alternating CG . . . . . . . . . . . . . . . . . . . . . . . . 87

4.6.2 The Internal Tensor Restricted Isometry Property . . . . . . . . . . . 88

Page 9: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

CONTENTS ix

5 Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 915.1 Stability for Complexity Calibration in Iterative Optimization Methods . . . 91

5.1.1 Introduction through Matrix Completion . . . . . . . . . . . . . . . . 915.1.2 The Importance of Stability for Iterative Fixed-Rank Methods . . . . 955.1.3 Overview over Rank Adaption Strategies in Literature . . . . . . . . . 96

5.2 Stable Alternating Least Squares Micro-Steps for Matrix Completion . . . . . 965.2.1 Stability through Convolution . . . . . . . . . . . . . . . . . . . . . . . 965.2.2 Exemplary Considerations regarding Stability . . . . . . . . . . . . . . 975.2.3 Variational Residual Function for Matrix Completion . . . . . . . . . 995.2.4 Minimizer of the Variational Residual Function for Matrices . . . . . . 100

5.3 Stability and Rank Adaption . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.3.1 Stability of Regularized Micro-Steps . . . . . . . . . . . . . . . . . . . 1035.3.2 Algorithmic Aspects and Rank Adaption . . . . . . . . . . . . . . . . 105

5.4 The Close Connection between Stabilization and Reweighted l1-Minimization 1075.4.1 Reweighted l1-Minimization . . . . . . . . . . . . . . . . . . . . . . . . 1085.4.2 SALSA as Scaled Alternating Reweighted Least Squares . . . . . . . . 1105.4.3 Fixed Points of Idealized Stable Alternating Least Squares for Matrices111

5.5 Stable Alternating Least Squares Tensor Recovery and Completion . . . . . . 1155.5.1 Restriction to Neighboring Singular Values and the Tensor Variational

Residual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.5.2 About the Minimizer of the Tensor Variational Residual Function . . 1195.5.3 Simplifications for Tensor Completion . . . . . . . . . . . . . . . . . . 1225.5.4 Fixed Points of Idealized Stable Alternating Least Squares for Tensors 123

5.6 The Stable ALS Method for Tree Tensor Networks . . . . . . . . . . . . . . . 1245.6.1 Preconditioned, Coarse, Alternating CG . . . . . . . . . . . . . . . . . 1255.6.2 Semi-Implicit Rank Adaption and Practical Aspects . . . . . . . . . . 1255.6.3 SALSA Sweep and Algorithm . . . . . . . . . . . . . . . . . . . . . . . 127

5.7 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.7.1 Data Acquisition and Implementational Details . . . . . . . . . . . . . 1305.7.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Approximate Interpolation of High-Dimensional, Scattered Data 1356.1 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.1.1 Approximation under Low-Rank Constraints . . . . . . . . . . . . . . 1366.1.2 Decomposition of the Thin-Plate Regularizer . . . . . . . . . . . . . . 137

6.2 Discretization and Alternating Least Squares . . . . . . . . . . . . . . . . . . 1396.2.1 Monovariate Kolmogorov Subspaces . . . . . . . . . . . . . . . . . . . 1396.2.2 Discretized Operators and Problem Setting . . . . . . . . . . . . . . . 1446.2.3 Rank Adaption and Practical Aspects . . . . . . . . . . . . . . . . . . 146

6.3 Demonstration via Rolling Press Data . . . . . . . . . . . . . . . . . . . . . . 1486.3.1 Implementational Details and Preprocessing . . . . . . . . . . . . . . . 1496.3.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.4 Comparison with Discrete Tensor Completion . . . . . . . . . . . . . . . . . . 158

III Feasibility of Tensor Singular Values 159

7 The Quantum Marginal and Tensor Feasibility Problem 1617.1 The Tensor Feasibility Problem (TFP) . . . . . . . . . . . . . . . . . . . . . . 161

7.1.1 Introduction regarding Tree Tensor Formats and Quantum Physics . . 1617.1.2 Formal Definition of the Tensor Feasibility Problem . . . . . . . . . . 162

7.2 The Quantum Marginal Problem (QMP) . . . . . . . . . . . . . . . . . . . . . 163

Page 10: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

x CONTENTS

7.2.1 The Pure QMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1647.2.2 Results for the Quantum Marginal Problem . . . . . . . . . . . . . . . 164

7.3 Independent Results for the Tensor Feasibility Problem . . . . . . . . . . . . 1667.4 Feasibility in Tree Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . . 166

7.4.1 Decoupling through the Tree SVD . . . . . . . . . . . . . . . . . . . . 1667.4.2 Iterative Algorithms to Construct Tensors with Prescribed Singular

Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1697.4.3 Feasibility of Largest Tucker Singular Values . . . . . . . . . . . . . . 1707.4.4 Direct Construction of Tensors Realizing Prescribed, Largest Tucker

Singular Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8 Honeycombs and Feasibility of Singular Values in the TT Format 1778.1 Decoupling for the Tensor Train Format . . . . . . . . . . . . . . . . . . . . . 1788.2 Feasibility of Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.2.1 Constructive, Diagonal Feasibility . . . . . . . . . . . . . . . . . . . . 1798.2.2 Weyl’s Problem and the Horn Conjecture . . . . . . . . . . . . . . . . 181

8.3 Honeycombs and Hives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1838.3.1 Honeycombs and Eigenvalues of Sums of Hermitian Matrices . . . . . 1838.3.2 Hives and Feasibility of Pairs . . . . . . . . . . . . . . . . . . . . . . . 1858.3.3 Hives are Polyhedral Cones . . . . . . . . . . . . . . . . . . . . . . . . 188

8.4 Cones of Squared Feasible Values . . . . . . . . . . . . . . . . . . . . . . . . . 1898.4.1 Necessary Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 1908.4.2 Rates of Exponential Decay . . . . . . . . . . . . . . . . . . . . . . . . 1928.4.3 Vertex Description of F2

m,(m,m2) . . . . . . . . . . . . . . . . . . . . . 193

8.4.4 A Conjecture about F2m,(m2+m−2,m2+m−2) . . . . . . . . . . . . . . . . 194

8.5 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1958.5.1 Linear Programming Algorithm for Feasibility Based on Hives . . . . 1968.5.2 Recapitulation of Feasibility Methods . . . . . . . . . . . . . . . . . . 196

Conclusions 199

References 201

Page 11: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Abbreviations and Primary Use ofSymbols

Algorithms

altproj T = altproj(σ,m) : iteratively constructs a tensor T with approximate singularvalues svj(T ) = σ(j), j = 1, . . . , k, Alg. 14

coltsv Rσ,θ = coltsv(σ(1)1 , . . . , σ

(d)d , n) : directly constructs HT tree SVD Rσ,θ of a tensor

with largest Tucker singular values σ(1)1 , . . . , σ

(d)d and common mode sizes n, Alg. 15

feaslpc [m,H] = feaslpc(γ, θ) : returns minimal number m ∈ N for which (γ, θ) is feasibleand a corresponding (r, 2(m− 1))-hive H, Alg. 16

hrtlt N = hrtlt(T, J) : root-to-leaves decomposition (truncation) of T into a tree tensornetwork N corresponding to the hierarchical family K = Jiki=1, Alg. 4

linmin N = linmin(A,N,T, c) : vanilla ALS for ‖AN − T‖ → min, Alg. 8

ltrdec [N,Nσ] = ltrdec(T,G, c) : leaves-to-root decomposition of T into a tree tensornetwork N as well as tree SVD Nσ corresponding to the tree G, Alg. 3

matrixsalsa [X,Y ] = matrixsalsa(P,M |P ) : rank-adaptive Salsa for matrices for approximaterecovery of M , Alg. 10

msalsasweep [X,Y ] = msalsasweep(X,Y, σmin, ω) : Salsa matrix sweep, Alg. 9

normal [N,Nσ] = normal(N, c) : returns c-orthogonalized tree tensor network N and itstree SVD Nσ operating only within the network, Alg. 7

ortho N = ortho(N, c) : c-orthogonalization of a tree tensor network N, Alg. 1

pathqr N = pathqr(N, c1, c2) : c2-orthogonalization of a c1-orthogonal tree tensor networkN, Alg. 2

rtlgraph [G,m, b] = rtlgraph(J) : returns the graph G = (V,E, L) corresponding to thehierarchical family K = Jiki=1 with edge label map m and sets b, Alg. 5

rtltrunc N = rtltrunc(T,G, c) : root-to-leaves decomposition (truncation) of T into a treetensor network N corresponding to the tree G, for a root c, Alg. 5

tensorsalsa N = tensorsalsa(K, L, y) : rank-adaptive Salsa for tensors for approximate re-covery of M , y = LM , Alg. 13

tirls A = tirls(K, y, L,A) : tensor iterative reweighted least squares (non alternating)for approximate recovery of M , y = LM , Alg. 11

tsalsasweep [N, B] = tsalsasweep(N,L, B, c, o, σmin, ω) : Salsa tensor sweep, Alg. 12

Greek Letters

α (outer) mode label set α = α1, . . . , αd, Eq. (2.23) and Rem. 2.11 and 2.18

αJ outer mode label set αJ = αjj∈J , Sec. 3.3.1

α(h) outer mode label set α(h) = α(h)1 , . . . , α

(h)d , d(α(h)) = h1 · . . . ·hd of the discretized

version of the Hilbert space Hα, Sec. 6.2.2

β (inner) mode label, Def. 2.31

β (inner) mode label set, usually β = β1, . . . , β|E|, Sec. 2.5 and 3.3.1

γ, δ, . . . mode labels, possibly element of α, β

xi

Page 12: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

xii ABBREVIATIONS AND PRIMARY USE OF SYMBOLS

γ, δ, . . . mode label (multi)sets, possibly subsets of α, β

δ(·) boundary map of honeycombs, Eq. (8.15)

δP boundary map of a hive H ∈ HIVEn,M , Def. 8.20

κ2 condition number with respect to the Euclidean norm, Eq. (4.36)

(λ, µ, ν) in Chap. 8: a triplet of eigenvalues of matrices (A,B,C), Def. 8.11

σ, (γ, θ) singular value tuple; in Chap. 8: a pair of such

σe, σ(J) singular values σe = σ(J) associated to the edge e = eJ = v, w ∈ E within a tree

tensor network, obtained through the map svαJ = svbv(w), Sec. 3.3.1

Σ, (Γ,Θ) diagonal matrix of singular values, e.g. Σ = diag(σ), Def. 2.36

τ contraction or representation map with domain D, possibly specified through a labelfunction m, a network N, or some rank r, Rem. 3.3 and Eq. (3.10)

φ (different) isomorphisms / isometries, Eq. (2.6) and Sec. 2.4.1 and 6.2.2

ω regularization parameter for Salsa, Chap. 5

Ωγ domain of Hilbert space of functions Hγ ⊂ RΩγ , Sec. 2.4

Ωγ domain of tensor product Hilbert space of multivariate functions Hγ ⊂ RΩγ , Sec. 2.4

Latin Letters

(b,B,B) bilinear mapping b inducing an operator B on a larger tensor space; if B allows (andis low-rank), represented by tensor node (network) B, Sec. 2.8.1 and 6.2.2

bv(w) the outer mode labels assigned to legs within branchv(w), Sec. 3.3.1

B(·)v,w product of (specified) nodes within branchv(w) of a (specified) network, Sec. 4.3.2

and 4.5.1

BDRYn in Chap. 8: boundary set of feasible triplets given by δ(h) | h ∈ HONEYn,Eq. (8.17)

bil0(V,W ;U) set of continuous, bilinear forms ϕ : V ×W → U , with U = R by default

branchc(v) branch of v relative to a root node c (v and its accumulated descendants), Def. 3.6

Dr data space in which representations of rank r are contained and domain of τr,Rem. 3.3 and Def. 5.1

d(γ) dimension of Hilbert space Hγ , Not. 2.14 and Sec. 2.4.1

d dimension/order d ∈ N of tensors or multivariate functions

D set of indices D = 1, . . . , ddiagγ partial diagonal-of-matrix to vector operation w.r.t. γ, Rem. 2.30

diagγ partial vector to diagonal-matrix operation w.r.t. γ, Rem. 2.30

D∞≥0 cone of weakly decreasing, nonnegative sequences with finitely many nonzero entries,Def. 7.1

e, eJ edge e ∈ E; eJ corresponds to the subset J in a tree tensor network, Sec. 3.3.1

E set of nonsingleton edges in a graph, elements in e ∈ E are involved in joint con-tractions, Def. 3.2

ei the unit vector ei ∈ Rk (for a by context given k) with one nonzero entry 1 atposition i

f, g, h, . . . real valued functions; in the introductory chapters with codomain N0, i.e. multisets,Sec. 2.3.3

F2m,(r1,r2) in Chap. 8: set of squared, m-feasible pairs (γ, θ) ∈ Dr1≥0 ×D

r2≥0, Def. 8.9

G = (V,E, L) graph G = (V,E) with legs L corresponding to a tensor network, Def. 3.2

G = (V, E ,L) graph corresponding to a tree SVD, Thm. 3.16

Hf , Hγ tensor product Hilbert space for f : α → N0, Hf := H⊗f(γ1)γ1 ⊗ . . . ⊗ H

⊗f(γd)γd , or

directly for γ = f , Eq. (2.22), Rem. 2.18, and Sec. 2.4.1

H in Chap. 8: a collection of M ∈ N single n-honeycombs, denoted as hive H ∈HIVEn,M , Sec. 8.3.2

HIVEn,M in Chap. 8: set of all (n,M)-hives with a by context given structure ∼S , Def. 8.19

hom0(V,W ) set of continuous, linear forms ϕ : V →W

HONEYn in Chap. 8: set if all n-honeycombs, Sec. 8.3.1

Page 13: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

ABBREVIATIONS AND PRIMARY USE OF SYMBOLS xiii

i, j indices i ∈ R, multi-indices i ∈ Rd

In, Iγ identity matrix In ∈ Rn×n or Iγ = Iγ(γ,γ) ∈ Rd(γ)×d(γ)

J , Je subset of a (hierarchical) family K; Je if corresponding to edge e in a tree tensornetwork, Sec. 3.3.1

K (hierarchical) family of subsets in I = 1, . . . , d − 1 or D = 1, . . . , d, Def. 3.14and Sec. 7.1.2

K univariate, nested Kolmogorov subspaces, Def. 6.2

(`, L,L) linear mapping ` inducing an operator L on a larger tensor space; if L is Hilbert-Schmidt (and low-rank), represented by tensor node (network) L, Sec. 2.8.1 and 4.3

L set of singleton edges (legs); associated to outer mode labels, Def. 3.2

L2 subspace L2(S) ⊂ RS of square-integrable functions

m(N) mode labels of a tensor node, N = (v,m(N)), Not. 2.14

m(e) mode labels assigned to an edge or leg, Def. 3.4

[m](j,`)P specific pairwise disjoint subsets of the sampling set P , Eq. (4.32)

M(v) micro-step M(v) = M(v)r in an ALS algorithm that updates the node with index

v ∈ V , Eq. (4.11)

n mode sizes; in particular n = (n1, . . . , nd), d(α) = n1 · . . . · nd, d(αµ) = nµ, µ =1, . . . , d

N (h) node within discretized space, Sec. 6.2.2

nJ accumulated mode size nJ =∏j∈J nj , Sec. 3.3

N,M, T (tensor) nodes, Def. 2.12

N,M,T tensor node networks N = Nvv∈V , N = v∈VNv, Not. 3.1

N nodes within a tree SVD, Thm. 3.16

Nγ space of tensor nodes with mode labels in γ, Eq. (2.23)

node−c (v) set of predecessors (parents) of v relative to a root c, Def. 3.6

node+c (v) set of descendants (children) of v relative to a root c, Def. 3.6

Nσ the tree SVD network, Thm. 3.16

P , Pval, Ptest training/sampling set P ⊂ Ωα, validation set Pval (used indirectly) and test set Ptest

(treated as unknown), Def. 5.18 and Sec. 5.7.1

Pk subsets of cardinality k, Pk(S) := S ⊂ S | |S| = kre, r

(J), ri rank re = r(J) associated to the edge e = eJ = v, w ∈ E within a tree tensornetwork, equals d(m(v, w)) and is obtained through the map rankbv(w); or rank rias dimension d(βi) of Hβi , Eq. (3.8), Sec. 3.5.1, and Thm. 3.16

rankγ(N) rank of a tensor node w.r.t. γ, equals rank(N (γ),(m(N)\γ)), Sec. 2.6.1 and Def. 2.31

RTP thin-plate regularization term, Eq. (6.1)

sumγ partial vector to summed-entries operation w.r.t. γ, Rem. 2.30

svJ , svγ singular values of the matricization w.r.t J or equivalently w.r.t. γ = αJ , Def. 2.36

Tr,K(Hα) set of tensors T ∈ Hα with rank(s) rankαJ (T ) = r(J), J ∈ K; K and Hα may beomitted if context allows, Def. 3.15

traceJ , traceγ partial trace with respect to set J or labels γ = αJ ; equals sumγ diagγ , Rem. 2.30

u, v, w either: nodes v ∈ V ; or: vectors in (tensor product) Hilbert spaces that may be

composed of smaller ones v =⊗

j v(j) =

⊗i,j v

(j)i , where v

(j)i for fixed j belong to

the same space

U, σ, Vt components of an SVD A = UΣVt, Σ = diag(σ)

vec(·) columnwise vectorization of a matrix A ∈ Rn×m, vec(A) ∈ Rnm, such that forexample vec(a11, a12; a21, a22) = (a11, a21, a21, a22)T

V set of nodes in a graph, or network, Def. 3.2

V (outer) nodes V (outer) = v(outer)1 , . . . , v

(outer)d of a network contained in legs, αj =

m(v(outer)j ), Def. 4.1

W 2,2 Sobolev space W 2,2(S) ⊂ L2(S) of square-integrable, twice weakly differentiablefunctions

Page 14: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

xiv ABBREVIATIONS AND PRIMARY USE OF SYMBOLS

(w, e, s) in Chap. 8: components of the boundary map, that is a triplet of functions map-ping a hive to the constant coordinates of its west, east and south boundary rays,respectively, Eq. (8.15)

WS product of nodes WS := N ′S L′ L NS for S ⊂ V (or other specified nodes);equals Z′S ZS , Eq. (4.14)

x, y, z vectors in Rk, k ∈ N, with entries x1, . . . , xk ∈ RX,Y components of a low-rank matrix decomposition A = XY , Sec. 5.1.1

ZS product of nodes ZS := LNS for S ⊂ V (or other specified nodes), Eq. (4.26)

Tensor Nodes

N = N(γ) tensor node with mode labels γ = γ1, . . . , γk#, Not. 2.14

N = N(γ,γ) matrix node with duplicate mode label set γ, m(N) = γ]γ, Sec. 2.3.2 and Eqs. (2.29)and (2.32)

‖N‖ norm of a node ‖N‖ = ‖v‖f , N = (v, f), where ‖ · ‖f is the norm induced on Hf ,Def. 2.21

N(γ = x) restriction of modes with labels γ1 < . . . < γk to the points x = (x1, . . . , xk) ∈ Ωγ ,respectively, Def. 2.25

N(γ ∈ D) restriction of domain associated to γ to the new domain D ⊂ Ωγ , Def. 2.25

N (γ),(δ) unfolding (matricization) of node N w.r.t. label sets γ, δ, N (γ),(δ) ∈ Rd(γ)×d(δ),Def. 2.26

N(γ 7→ γ′) renaming of mode labels γ into mode labels γ′, Def. 2.38

NT transpose of a node, reverses ordering of modes with equal labels, Def. 2.20

NS accumulation and product of nodes NS = s∈SNs, S ⊂ V

Symbols

,∼c in Chap. 8: relation defining Weyl’s problem, related to the TFP, Def. 8.11

tensor node product = KC (K: keep mode, C: contract mode), Def. 2.8, 2.13, 2.15and 2.16

, © composition of mappings, f1 f2(x) = f1(f2(x))

γ 7→ j multiset γ, . . . , γ# of cardinality j, or ]ji=1γ if γ is a set (of labels), Sec. 2.3.3

·H Hermitian (complex) transpose of a matrix (AH)i,j = Aj,i< ordering; abstract ordering of mode label sets, Not. 2.19

·:,j the j-th column of a matrix, Def. 5.9

·i,: the i-th row of a matrix, Def. 5.9

·# distinguishes sets · from multisets ·#, Sec. 2.3.3

‖ · ‖ norm, induced norm on tensor product of Hilbert spaces, Sec. 2.2.1

‖ · ‖F Frobenius norm, or Hilbert-Schmidt norm, equals the induced norm on tensor prod-ucts of Euclidean spaces, Sec. 2.2.2

γ (induced) product γ : Hγ×Hγ → Hγ , by default the Hadamard product (entrywisemultiplication), Sec. 2.3.1

⊗a, a⊗

(algebraic) tensor product, Sec. 2.2.1

⊗,⊗

(topological) tensor product, for Hilbert spaces w.r.t to the induced norm on thealgebraic tensor product, Sec. 2.2.1

⊗i i-fold tensor product, Eq. (2.18)

·+ updated element in an iterative process, Eq. (4.7)

·† (Moore-Penrose) pseudoinverse of a matrix

·+ positive part of a nonnegative sequence or vector, Def. 7.1

〈·, ·〉γ scalar product (induced) on Hγ , by default the Euclidean/`2/L2 scalar product,Sec. 2.2.1 and 2.3.1

| · | seminorm; or cardinality of a set or multiset, e.g. |1, 2| = 2, |1, 1#| = 2; orvolume |[a, b]d| = (b− a)d, Sec. 2.3.3

\ subtraction of elements, for multisets: (S1\S2)(e) = max(0, S1(e)−S2(e)), Sec. 2.3.3

Page 15: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

ABBREVIATIONS AND PRIMARY USE OF SYMBOLS xv

(i), (J), (αJ ) (i), (J) are used as index; (J) is also the matricization of a tensor with respect to a setJ ⊂ D, or, equivalently, to labels αJ for T = T (α), Def. 7.2 and 2.26 and Sec. 3.3.1

·T transpose of a matrix (AT )i,j = Aj,i; for nodes, see notations for tensor nodes

· 6=v restriction to all entries but index v; accumulation (and product) of all nodes inV \ v, Sec. 4.3

] additive union of two (multi)sets, (S1 ] S2)(e) = S1(e) + S2(e), Sec. 2.3.3

Terms and Abbreviations

ALS alternating least squares, Sec. 4.3

compatible a realizable constellation of eigenvalues in the QMP, Def. 7.4

duplicate mode label a mode label that appears more than once within one tensor node such asN = N(γ, γ), Sec. 2.3.2

equivalent network network which represents the same tensor

feasible a realizable constellation of values, in particular singular values in the TFP, Def. 7.2

hive in Chap. 8: collection of honeycombs, Sec. 8.3.2

honeycomb in Chap. 8: (graphical interpretation of the) H-description of the cone of feasibletriples λ µ ∼c ν, Sec. 8.3.1

HT hierarchical Tucker, Sec. 2.5.3

hypergraph graph with an edge of cardinality other than two, Def. 3.6

mode labels multiset such as γ denoting multiplicities of Hilbert spaces Hγ1 , . . . ,Hγd , Eq. (2.23)

multigraph graph with duplicate edges such that E is a multiset, Def. 3.6

multiset set with multiplicities S = e1, . . . , ek#, interpretable as function with codomainN0, S(e) := |i | e = ei|, Sec. 2.3.3

nested variable variable in an algorithm that is global within a declared procedure and subfunctions,in teal color

node SVD/QR SVD/QR dec. w.r.t. to set of mode labels, Sec. 2.6.3

normal form normalized form of a network that is the tree SVD of the tensor it represents,Thm. 3.16

c-orthogonal tensor network which is orthogonal w.r.t a node c ∈ V , Def. 3.9

γ-orthogonal (row-,column-)orthogonality constraint w.r.t. labels γ, e.g. N (m(N)\γ),(γ) is column-orthogonal, Sec. 2.6.2 and Def. 2.32

orthogonal matrix a matrix that has orthonormal columns (since this terminology is common inliterature)

QMP (a version of the) quantum marginal problem, Sec. 7.2

redundant network tree tensor network with same expressiveness as a smaller, partially contractednetwork, Def. 3.8

representation a network with an emphasis on the use of such as tool to represent a tensor

RIP restricted isometry property, Eqs. (5.14) and (5.15)

Rwals reweighted alternating least squares, Eq. (5.37) and Sec. 5.7

Salsa stable alternating least squares approximation, Sec. 5.3, 5.5.1 and 5.6.3

stable in this thesis a stable method in the sense of Def. 5.1

SVD singular value decomposition, Thm. 2.4

tensor format class of networks corresponding to a specific concept of graph, Sec. 2.5

tensor node a tuple N = (v, f) of an ordinary tensor v ∈ Hf together with its mode labelsf = m(N), Eq. (2.23)

TFP tensor feasibility problem, Sec. 7.1

tree SVD essentially unique tree tensor network as close analogue to the ordinary matrix SVD,also known as canonical or normal form, Thm. 3.16

tree tensor network network of nodes without duplicate mode labels in each one node correspondingto a tree graph, Def. 3.7

tree a graph without cycles

TT tensor train, Sec. 2.5.1 and 3.4.1

Page 16: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

xvi ABBREVIATIONS AND PRIMARY USE OF SYMBOLS

Graphs and Networks

γ-orthogonal node, γ ⊂ m(N)

σβ

singular value node σ = σ(β)

Rγ γ

matrix node R = R(γ, γ)

a rooted node to which a network is orthogonalized

Page 17: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 1

Introduction

Dear reader,

the short, first part of this introduction and overview is dedicated to all who are less familiarwith tensors or mathematics in general. So in Section 1.2, we briefly recapitulate the initialpart in a formal manner, while we start the main mathematical introduction in Chapter 2.Section 1.3 outlines the organization of this thesis and includes statements of contribution.

1.1 A Less Formal Preface

A so-called tensor, or at least a large class of them, can be conceptualized as entries withina regular grid. While a matrix can be visualized as table of rows and columns, a tensorgeneralizes this concept by adding more dimensions.

. . .

Figure 1.1: Each step adds one more characteristic: size, filling, shape and color. The different combina-tions of these yield, from left to right, a vector with 2 entries, a 2× 2 matrix, a 2× 2× 2 tensor. Last butnot least, we add a fourth dimension, at which point one may be unused to its visualization, and obtain a2× 2× 2× 2 tensor.

For each of the four dimensions of the tensor in Fig. 1.1, there is a specific size, filling,shape and color, respectively. So if we construct the tensor by this recipe, much fewer optionsremain, and this effect becomes more noticeable with an increasing dimension. The tensoris this sense less complex than its size may suggest, and called an elementary tensor. Ingeneral, a tensor may result from any categories, such as parameters involved in the designof an experiment or device, and thereby summarize all possible outcomes. The number ofelementary tensors that are needed to represent a tensor (by summation) is called the rank.The lower the rank of a tensor, the less complex it is.

In the simplest case above, by our recipe, we have disassembled the tensor into smallerobjects that are much easier to handle. One does not usually draw tensors as elaborate as

1

Page 18: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2 1.2. Tensors as Multivariate Functions

above. Instead, we depict them as (teal) circles with lines attached to them, one for eachdimensions of it. The representation of the elementary tensor as combination of choiceswithin the four categories can then be visualized as in Fig. 1.2.

Such representations (as we introduce in Part I of this thesis), once we turn them intomore complicated networks, are a basis for algorithms that are applied to problems thatdepend on many parameters, or categories. At the same time, the levels of complexity oftensors become more distinguished, to a point at which they become the focus of attentionthemselves (as in Part III). These tensor networks happen to resemble the neural networkswithin human brains, and there is indeed a cross connection between tensor theory andartificial neural networks.

=

size

filling

⊗shape

⊗color

Figure 1.2: Schematic decomposition of a rank-one tensor. The symbol ⊗ here just reminds us of the recipeapplied above. Usually, in each entry of the tensor, we would for example be allowed to reduce the colorvalue if we increase its size by the same fraction. So the example should not be over-interpreted.

If we were to remove one or a few (but not too many) entries within the matrix, orin one of the two tensors above (Fig. 1.1), then we could intuitively guess the missingentries. This assumption of an underlying structure behind it is a key fact related to nearall tensor algorithms. Formally, the task to reconstruct such entries is called completionproblem (which is one of the main concerns in Part II), which among many other applicationsis used in movie recommendation systems, cancer prediction and seismographic analysis.Incidentally, similar forms of data compression techniques are also used for the entertainingpurposes to imitate artwork which Van Gogh for example never drew or to show pictures ofcats that do not exist1.

1.2 Tensors as Multivariate Functions

The 4-dimensional tensor as depicted in Fig. 1.1, once we replace its entries with real valuednumbers2, is a classical interpretation as multi-array, or discrete multivariate function,

t ∈ Rn1×...×nd , ti1,...,id ∈ R, iµ ∈ 1, . . . , nµ, µ = 1, . . . , d,

for d ∈ N and n ∈ Nd (where, above, d = 4 and nµ = 2, µ = 1, . . . , d). Due to theisomorphism, by which the following two spaces can be identified as one another,

V := Rn1×...×nd ∼= Rn1 ⊗ . . .⊗ Rnd ,it gives a good impression of what the tensor product ⊗ represents. Interpreted as euclideanspace, V becomes a Hilbert space, as it is equipped with the scalar product

〈t, a〉 :=∑

i=(i1,...,id)∈×dµ=11,...,nµti · ai, t, a ∈ V,

1Although we are not concerned with neural networks, the deepart and thiscatdoesnotexist websites arequite entertaining.

2Within the hypercube, we have in fact visualized each 4 factors within the tensor product as geometricobjects with 4 properties, and not the result itself.

Page 19: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

1. Introduction 3

which is the one induced by the scalar products on the univariate spaces Rnµ , µ = 1, . . . , d.The rank 1 decomposition of the elementary tensor as in Fig. 1.2 can be written as

ti = x(1)i1· . . . · x(1)

id, iµ = 1, . . . , nµ, µ = 1, . . . , d,

for vectors x(µ) ∈ Rnµ , µ = 1, . . . , d. Phrased as tensor product, we may interpret thisequation as

t = x(1) ⊗ . . .⊗ x(d).

The more elaborate tensor networks mentioned above are specifically structured sums ofsuch elementary tensors. As a generalization to the low-rank decomposition of a matrixA = XY , X ∈ Rn1×r, Y ∈ Rr×n2 , r = rank(A), the tensor t may be factorized into multiplelower dimensional tensors (or matrices) y(i), i = 1, . . . , k,

t = τ(y(1), . . . , y(k)),

for a multilinear map τ . Together with related concepts, this approach is introduced inPart I. Optimization algorithms can then operate on these components, as it is our mainpoint of attention in Part II. One particularly useful, large class of such decompositionsis summarized under the term hierarchical Tucker format (or tree tensor network). Eachspecific instance is determined by a so-called hierarchical (as defined later) family of subsetsK ⊂ J | J ⊂ 1, . . . , d. For each subset J ∈ K, one defines a simple reshaping of thetensor t into a matrix

t(J) ∈ RnJ×n1,...,d\J , nJ :=∏

j∈Jnj . (1.1)

Thereupon, the corresponding set of low-rank tensors is defined as

Tr,K := t ∈ Rn1×...×nd | rank(t(J)) = rJ , J ∈ K.

Due to the hierarchy property of K, each tensor in this set can be decomposed into a corre-sponding tree network (which implies the map τ), of components whose sizes are determinedby the ranks rJ . Thereby, the collection r(J)J∈K determines the overall complexity of suchtensors. For a matrix, we have the simple family K = 1, and the associated graph con-tains only two nodes X and Y , connected via the single edge. An analogy to the matrixsingular value decomposition (perhaps more familiar under the name principle componentanalysis) exists as well, and provides more distinguished levels of complexity based on thesingular values of each of the matrices t(J), J ∈ K. Due to their dependence on the sametensor, these value are not independent of each other, as we discuss in Part III.

1.3 Overview and Statements of Contribution

Some parts, as specified below, are based on the previously published papers [42]GrKr19 and[68]Kr19. I am corresponding author of [42]GrKr19 and all parts in this thesis that originatetherefrom have been my contribution. Of the paper [68]Kr19, I am its single author.

In Part I of this thesis, we present and analyze general concepts related to (tree) tensornetworks.

The graphical depiction of such is formalized by a specific calculus on so-called tensor nodes

Page 20: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4 1.3. Overview and Statements of Contribution

as we introduce in Chapter 2. Albeit developed independently as generalization emergingfrom [42]GrKr19 and early on influenced by [56], the most similar approach is likely to befound in [30].In Chapter 3, we analyze the behavior of tensor networks corresponding to tree graphs, andpresent a central result of this thesis, the so-called tree tensor singular value decomposition(based on the hierarchical Tucker decomposition [40, 48]), which incorporates multiple nor-mal forms of different, specific tensor formats. This includes all-orthogonality for the Tuckerformat or HOSVD [21], the canonical form for the tensor train, or MPS format [84,100] andthe gauge conditions presented in [40] for the hierarchical Tucker format. Different tensordecomposition methods are rewritten in the tensor node arithmetic, thereupon analyzed andcompared to literature, as well as outlined as according implementations in Matlab.

In Part II, the previously introduced concepts are applied to the solution of linear systemsand multivariate recovery problems.

Chapter 4 serves as introduction to the subsequent sections and discusses how alternatingleast squares algorithms can utilize the tensor network structure, similar to, for example, [56].The tensor node calculus allows us to complement these well known results by a generalizedcoarse alternating conjugate gradient scheme, related to a special case in [22] for the tensortrain format. We further discuss its connection to the tensor restricted isometry property.While the algorithms in the prior section are for a fixed rank, Chapter 5 introduces a specificconcept of stability that is concerned with the adaptivity of the model complexity in itera-tive methods. Based on this concept, we derive a rank-adaptive, so-called stable alternatingleast squares approximation (Salsa). In particular Sections 5.1 to 5.3 for the matrix caseare based on the previously published paper [42]GrKr19. In Section 5.4, we further discuss aformerly unknown close connection of Salsa to reweighting in nuclear norm minimization.The subsequent Sections 5.5 and 5.6 generalize the techniques and results in [42]GrKr19 andthe above-mentioned reweighting to tensors and arbitrary tree networks. Section 5.7 con-tains updated (due to the modifications presented in this thesis) results of the numericalexperiments covered in [42]GrKr19.In Chapter 6, the algorithms introduced in the prior chapter are generalized to allow foran approximate interpolation of high-dimensional, scattered data, based on a regularizationoriginating from thin-plate splines. At the same time the presented theory was developed,a group of A. Nouy worked on the similar topic of learning with tree-based tensor formats,which has recently appeared as preprint [44]. The chapter remains yet independent, notleast because of differences in the overall approach. In Section 6.1.1 however, we refer toa more detailed discussion of the theoretical background for the approximation in tensorproduct Sobolev spaces [1]. Section 6.3 contains a demonstration of the presented algorithmfor data provided by the institute of metal forming (IBF) at RWTH Aachen.

Of more theoretical and algebraic nature is Part III, in which we are concerned with ananalysis of the interrelation between singular values associated to tensors.

In Chapter 7, we show that this tensor feasibility problem is equivalent to a version ofthe quantum marginal problem. Sections 7.1 to 7.3 recall the corresponding sections ofthe previously published article [68]Kr19, but have been complemented with more details.Section 7.4.1 applies results related to the tree SVD in order to generalize the decouplingpresented in [68]Kr19 for the tensor train format. In Section 7.4.3, we present an alternativeproof for a feasibility result contained in the final publication of [23]3, which allows for a

3Their result is accidentally not mentioned in the abstract of the paper.

Page 21: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

1. Introduction 5

fast algorithm that constructs a tensor with prescribed, largest Tucker singular values.While Section 8.1 in Chapter 8 explains the context of the decoupling presented in theprevious chapter, Sections 8.2 to 8.4 recall the corresponding sections in [68]Kr19, with theexception of Section 8.4.2, which discusses the relation between rates of exponential decay ofsingular values, and Section 8.4.4, which presents a conjecture reasoned upon the techniquesand proofs discussed in these sections. The last Section 8.5 presents a summary of methodsrelated to feasibility for the tensor train format.

I here also like to thank Prof. Morten Mørup for mentioning reweighted least squares tome in the context of [42]GrKr19 as well as the anonymous reviewers of [68]Kr19 for leading metowards the quantum marginal problem literature.

Page 22: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 23: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Part I

Tensor Networks and the TreeSVD

Page 24: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 25: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 2

Calculus of Tensor Nodes andNetworks

2.1 Introduction

The structure that is given by a collection of tensors, subject to certain interrelations betweenthem, is often referred to as tensor network. As these can typically be related to graphs,visualizations of such have been common. The articles [9,27,80] provide good surveys fromthe perspective of physics. Summaries of countless tensor methods and their applicationswithin the mathematical community can be found in [4, 17,20,43,79].

It turns out that a comprehensible formalization of general tensor networks requires a cer-tain amount of work. As we discuss in the following, one often finds a large discrepancybetween the simplicity of diagrams and formal proofs.

2.1.1 Motivation of the Tensor Node Arithmetic

In this initial chapter, we develop a tensor node calculus that serves as a formalization oftensor networks, and their graphical interpretation as below in Fig. 2.1. As we underlinein Section 2.1.2, it emerges from two familiar concepts — global labeling and the Einsteinnotation — and yields an algebraic structure that is practicable and naturally connected tosuch graphs.

N1

N2

N3

N4

(N1, . . . , N3)

Figure 2.1: A tensor network diagram in which a partial contraction M = (N1, N2, N3) appears. Concealedby this notation is a summation equivalent to Mi1,i2,i3 =

∑j1,j2

(N1)i1,j1 · (N2)i2,j1,j2 · (N3)i3,j1,j2 , forall i1, i2, i3.

There are two opposing approaches to tensor networks: the interaction between objects maybe stated manually depending on each instance, or on the contrary, be as a whole immutablydefined through a given map. These two extreme cases demonstrate that a central issue is abalance between incomprehension but flexibility and simplicity but fixation. Both ends have

9

Page 26: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

10 2.1. Introduction

their application, but seldomly allow for an easy identification of underlying structures. Inthe first outlined extreme, assertions may be shallow due to their individuality, while in thesecond case, they are more difficult to generalize.

The to be carried out tensor node arithmetic naturally has deep roots in the vast collectionof both approaches that have long existed in physics communities and those more recent inmathematical communities. Some of the presented concepts already emerge from [42]GrKr19

and have been subject to further generalizations with regard to tensor networks. Althoughdeveloped independently, the most similar foundation is likely to be found in [30]. Also [56]stands out due to its early on influence to this work, in particular the resemblance of illus-trations. We aim to give a deeper understanding of tensor networks through an arithmeticstrongly related to their graphical depiction and without an overuse of indices, while, mostimportantly, still being bound to strict mathematical foundations. At the same time, certainassertions such as the tree singular value decompositions seem otherwise impracticable toformulate precisely, let alone its formal proof. The arithmetic is further in a literal wayrealized by a toolbox as described in Section 2.1.3.

2.1.2 A Brief Impression

We here provide a brief impression of the tensor node arithmetic, which is introduced informal detail from Section 2.3 on, and discuss its basic scheme. When dealing with multi-variate functions, one often assigns an abstract label to each of the variables, as for examplein the declaration of the function

f : R3 → R, f = f(x, y, z) = x2 + y2 + z2.

One can use these labels for example to specify partial derivatives or to assigned value toone of the variables. This has the advantage that expressions such as

∂y(f |x=1) = (

∂yf)|x=1

hold true. The order of these two operations may hence be exchanged. This would not be thecase if one would strictly reference by numbers, since in g : R2 → R, g := f |x=1 = f(1, ·, ·),the positioning changes. While context here is yet sufficient to interpret these as commutingoperations, this approach soon leads to complications in more complex situations. The nota-tion above is moreover much easier to comprehend, in particular when x, y, z have intrinsicmeaning, and we do not have to remember the order in which these labels appear.

We make use of this reference by labels also for tensors, which include multivariate functionsas special cases (cf. Section 1.2). Given for example a tensor T ∈ Rn1×n2×n3 , we assignmode labels α1, α2, α3 to it by writing T = T (α1, α2, α3), or short T (α). We then call Ttensor node, usually using uppercase letters to emphasize this. We orient ourselves towardsconventional notations, but here index with αµ instead of the number µ ∈ 1, 2, 3. Similarto notation in probability theory, a specific entry i = (i1, i2, i3) of T is retrieved by

Ti = T (α = i) = T (α1 = i1, α1 = i2, α3 = i3) ∈ R.

There is no longer need to keep a fixed order of labels, although we keep the natural orderingα1 < α2 < α3 in mind. For example, T (α = i) = T (α2 = i2, α1 = i1, α3 = i3) yields thesame entry. A matricization, or unfolding (cf. Eq. (1.1), [40]) of the tensor T is written as

T (1,2) = T (α1,α2) ∈ Rn1·n2×n3 .

Page 27: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 11

The matrix T (α1,α2) is then not treated as tensor node anymore, unless we reassign newmode labels to it. For better readability, we include the remaining modes in the following.Thus, we write T (1,2),(3) = T (α1,α2),(α3) := T (1,2). We also allow to restrict only asubset of indices, such that

Ti1,:,: = T (α1 = i1)(α2),(α3) ∈ Rn2×n3 .

In the tensor node arithmetic, we essentially combine references by labels with a variantof the Einstein notation, in which equal indices imply a summation. In our context, thesesummations are generalized contractions, usually in larger networks of tensors. While wecomprehensively introduce this arithmetic in subsequent sections in formal detail, we alsogive simpler examples for usually more simple uses. Further, in later, advanced chapters,we use less formal notation such as here.

The calculus that emerges from the above-mentioned two principles is carried out by theproduct for which we have chosen the symbol . We exemplify it through a decompositionof a three-dimensional tensor (in the tensor train, or MPS, format). Let

N1 = N1(α1, β1) ∈ Rn1×r1 , N2 = N2(β1, α2, β2) ∈ Rr1×n2×r2 , N3 = N3(β2, α3) ∈ Rr2×n3 ,

be three tensor nodes. The simplest instance of a multiplication of two such nodes is givenby the product

A = N1 N2.

Then, since β1 is a label appearing in both N1 and N2, this index is summed over anddisappears, and we have A = A(α1, α2, β2) ∈ Rn1×n2×r2 . The entries of A are given by

A(α1 = i1, α2 = i2, β2 = `) =

r1∑

k=1

N1(α1 = i1, β1 = k) ·N2(β1 = k, α2 = i2, β2 = `).

We can also write this multiplication as matrix product,

A(α1),(α2,β2) = N(α1),(β1)1 ·N (β1),(α2,β2)

2 .

The product here allows to elegantly pull a restriction into single factors,

A(α1 = 1, α2 = 2) = N1(α1 = 1)N2(α2 = 2).

The multiplication of all three nodes is associative and commutative, such that for everypermutation π we have that

T = N1 N2 N3 = Nπ(1) Nπ(2) Nπ(3),

whereby the tensor T ∈ Rn1×n2×n3 is given through

T (α = i) =

r1∑

k=1

r2∑

`=1

N1(α1 = i1, β1 = k) ·N2(β1 = k, α2 = i2, β2 = `) ·N3(β2 = `, α3 = i3).

This arithmetic is particularly convenient when various tensors appear in a larger network,such as in common tensor formats discussed in Section 2.5. Since the mode labels definethe interaction between multiple nodes, collections of such automatically form networks,while subsets within retain the structure.

Page 28: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

12 2.2. Elementary Tensor Calculus on Real Hilbert Spaces

2.1.3 Implementation of the Tensor Node Arithmetic

Algorithms as well as many assertions given in this work are virtually identical to theiractual implementations and contained in a toolbox that is available online1 under the nametensor-node-toolbox or directly at

https://git.rwth-aachen.de/sebastian.kraemer1/tensor-node-toolbox.

It features a comprehensive introduction to its functionality through Matlab live scripts,which are also included in form of precompiled pdf documents. A glance into this morepractical introduction might as well give a good first impression. The toolbox was furtherused to implement the complex algorithms introduced in Chapters 5 and 6.

2.2 Elementary Tensor Calculus on Real Hilbert Spaces

The introductory chapters of the excellent book

Tensor Spaces and Numerical Tensor Calculus, W. Hackbusch (2012) [46]

constitute a theoretical foundation for this thesis. The therein contained elementary tensorcalculus, which we recall in the following, is our starting point.

2.2.1 The Algebraic and Topological Tensor Product

The algebraic tensor product ⊗a between two vector spaces V and W over R can formallybe defined as quotient vector space

V ⊗aW := F /∼ ,

where F is the free vector space over V ×W (cf. [46, Section 3.2.1]). For each pair (v, w) ∈ F ,an element in this quotient space is instead written as v⊗w. The equivalence class ∼ is theone generated by the condition that this product becomes a bilinear map

⊗ : V ×W → V ⊗aW.

Since V ⊗aW is by definition a vector space as well, we have that

V ⊗aW = r∑

i=1

vi ⊗ wi | vi ∈ V, wi ∈W, i = 1, . . . , r, r ∈ N0.

The tensor product v ⊗ w between two vectors v ∈ Rn and w ∈ Rm can for example beidentified as vwT ∈ Rn×m ∼= Rn ⊗a Rm, while the scalar product is given by vTw. Analternative definition of the tensor product is provided by the following universal property.

Proposition 2.1 (Universal property [46, Proposition 3.22]). The space of linear functionsfrom V ⊗a W to a vector space U is isomorphic to the space of bilinear forms on V ×Wonto U ,

hom(V ⊗aW,U) ∼= bil(V,W ;U).

In particular, for any bilinear map ϕ : V × W → U , there is a unique linear map Φ :V ⊗aW → U such that Φ(v ⊗ w) = ϕ(v, w) for all v ∈ V , w ∈W .

1Note that therein, some terminology and algorithm names might yet differ.

Page 29: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 13

For vector spaces U1 and U2, the tensor product φ1 ⊗ φ2 of two linear functions, thatis φ1 ∈ hom(V,U1) and φ2 ∈ hom(W,U2), has an interpretation as linear function Φ ∈hom(V ⊗aW,U), U := U1 ⊗a U2, induced by the bilinear form φ(v, w) := φ1(v)⊗ φ2(w),

Φ(v ⊗ w) = φ1(v)⊗ φ2(w), v ∈ V, w ∈W.

However, equality in hom(V,U1) ⊗a hom(W,U2) ⊆ hom(V ⊗a W,U) holds true if and onlyif V and W are finite (cf. [46, Proposition 3.49]).

The algebraic tensor product of more than two spaces V1, . . . , Vd is defined analogouslyas multilinear map, and for all subsets J ⊂ D = 1, . . . , d, we have (cf. [46, Section 3.2.5])

a

⊗µ∈D

Vµ ∼=(a

⊗i∈J

Vi)⊗a

(a

⊗j∈D\J

Vj),

whereas the above procedures extend straightforwardly to this case.

The topological tensor product ⊗‖·‖ between Banach spaces is defined as closure of thealgebraic one with respect to a norm ‖ · ‖ (cf. [46, Section 3.2.1]),

V ⊗‖·‖W := V ⊗aW‖·‖, ‖·‖

⊗µ∈D

Vµ := a

⊗µ∈D

Vµ‖·‖.

So while the algebraic tensor product space contains all finite sums, the topological one alsocontains the infinite ones, so to speak. For two Hilbert spaces V and W , there is a uniqueinduced scalar product on V ⊗aW (cf. [46, Section 4.5.1]) subject to the condition

〈v1 ⊗ w1, v2 ⊗ w2〉 = 〈v1, v2〉 · 〈w1, w2〉, ∀v1, v2 ∈ V, w1, w2 ∈W. (2.1)

As this scalar product defines a norm ‖ · ‖, the short notation ⊗ denotes the topologicaltensor product with respect to this norm. As for the algebraic one, it further holds that

⊗µ∈D

Vµ ∼=(⊗

i∈JVi)⊗(⊗

j∈D\JVj), (2.2)

for all subsets J ⊂ D. The topological tensor product is, despite its name, not actually atensor product in the conventional sense, since it does not necessarily fulfill the universalproperty. Its dual space may not even be isomorphic to all continuous2 bilinear forms,

(V ⊗W )∗ bil0(V,W ). (2.3)

For example, ϕ = 〈·, ·〉 ∈ bil0(V, V ) is a continuous, bilinear form, but it induces the traceoperation on V ⊗a V , which is not bounded. However, if the induced mapping is in factcontinuous, then the following lemma can be applied.

Lemma 2.2 (Linear extension theorem [46, Remark 4.1]). Let S be a (dense) subspace of aHilbert space H and let Φ : S → H be a continuous, linear form. Then Φ can be extendedto a (unique) continuous, linear form on the whole of H.

Analogously, the same is valid for bilinear forms.

Lemma 2.3 (Bilinear extension theorem [51, Corollary 1]). Let S1, S2 be (dense) subspacesof Hilbert spaces H1, H2, respectively, and let b : S1×S2 → H be a continuous, bilinear form.Then Φ can be extended to a (unique) continuous, bilinear form on the whole of H1 ×H2.

For dense subspace, the uniqueness follows directly by continuity of the extended map.

2Continuity on the cross-product norm for multilinear forms is equivalent to the bound ϕ(v1, . . . , vk) ≤C‖v1‖ . . . ‖vk‖ for all vi ∈ Vi, i = 1, . . . , k, where C = ‖ϕ‖W←V1×...×Vk is the operator norm of ϕ.

Page 30: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

14 2.2. Elementary Tensor Calculus on Real Hilbert Spaces

2.2.2 (Infinite) Singular Value Decomposition and Hilbert-SchmidtOperators

In this section, let V and W be infinite-dimensional Hilbert spaces. We do not repeat thefollowing assertions for finite dimensions, as they can easily be reformulated. The dual of avector v ∈ V is given as v∗ : V 7→ R, v∗ = 〈v, ·〉. A fundamental theorem for all subsequentsections is the infinite singular value decomposition (SVD).

Theorem 2.4 (Infinite singular value decomposition [46, Theorem 4.114]). Let Φ : W →V be a compact operator for infinite-dimensional Hilbert spaces V,W . Then there existnonnegative values σ1 ≥ σ2 ≥ . . . with σi 0 as well as orthonormal systems vi ∈ V | i ∈N and wi ∈W | i ∈ N such that

Φ =

∞∑

i=1

σiviw∗i , (2.4)

which converges with respect to the operator norm ‖ · ‖W←V :

‖Φ− Φ(k)‖V←W = σk+1 0, Φ(k) :=

k∑

i=1

σiviw∗i ,

Conversely, any Φ defined such as in Eq. (2.4) with σk 0 is a compact operator.

The values σ = σ(Φ) = (σ1, σ2, . . .) are called singular values. We further denote vi, wi,i ∈ N, as left and right singular vectors, respectively. The number of nonzero singular valuesis called rank.

The above theorem hence yields an equivalent characterization of compact operators, andfor every Φ as above, we have that (cf. [46, Section 4.4.3])

‖Φ‖SVD,∞ := σ1(Φ) = ‖Φ‖V←W .

Similarly, Hilbert-Schmidt operators are exactly those compact mappings for which

‖Φ‖2SVD,2 :=

∞∑

i=1

σ2i (Φ) (2.5)

is finite. The set HS(V,W ) of such is again a Hilbert space with scalar product defined as

〈Φ,Ψ〉HS :=

∞∑

v∈S〈Φ(v),Ψ(v)〉, Φ,Ψ ∈ HS(V,W ),

for any orthonormal basis S of V . The definition does indeed not depend on the particularchoice of this basis. The norm ‖ · ‖HS induced by this scalar product thus equals the norm‖ · ‖SVD,2, and is also known as Frobenius norm ‖ · ‖F . The Hilbert space HS(V,W ) isisometric to the tensor product space V ⊗W via an isometry φ (cf. [46, Lemma 4.119]) thatis uniquely defined through

φ : v ⊗ w 7→ vw∗, v ∈ V, w ∈W. (2.6)

For any tensors y1, y2 ∈ V ⊗W , it thus holds true that

〈y1, y2〉 = 〈φ(y1), φ(y2)〉HS. (2.7)

Page 31: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 15

Each tensor V ⊗W thereby also has an SVD as in Eq. (2.4) via

φ( ∞∑

i=1

σivi ⊗ wi)

=

∞∑

i=1

σiviw∗i . (2.8)

We are particularly interested in the interpretation of Hilbert-Schmidt operators as ele-ments in the corresponding tensor product space, such as V ⊗W , as is further discussedin Section 2.3.2. The singular value decomposition allows to easily prove the following twotheorems.

Lemma 2.5. Let V1, V2 and H be Hilbert spaces. Further, let ϕ : V1 → H be a continuous,linear form. Then there is a unique continuous, linear form

Φ : V1 ⊗ V2 → H ⊗ V2,

induced by

Φ(v1 ⊗ v2) := ϕ(v1)⊗ v2.

The operator norm of Φ is the same as of ϕ.

Proof. By the universal property, the linear form Φ is uniquely defined on the dense, alge-braic tensor product subspaces V1⊗a V2. We now show that it is bounded. Let v ∈ V1⊗a V2

be arbitrary. Then it has a finite singular value decomposition

v =

r∑

i=1

σiv(1)i ⊗ v

(2)i

for orthonormal systems v(1)i i and v(2)

i i. We hence have that

‖Φ(v)‖2 = ‖r∑

i=1

σi ϕ(v(1)i )⊗ v(2)

i ‖2 =

r∑

i,j=1

σiσj 〈ϕ(v(1)i ), ϕ(v

(1)j )〉H · 〈v(2)

i , v(2)j 〉V2

=

r∑

i=1

σ2i 〈ϕ(v

(1)i ), ϕ(v

(1)i )〉H

Due to 〈ϕ(v(1)i ), ϕ(v

(1)i )〉U = ‖ϕ(v

(1)i )‖2 ≤ ‖ϕ‖2‖v(1)

i ‖2 = ‖ϕ‖2 it follows that

‖Φ(v)‖2 ≤ ‖ϕ‖2r1∑

i=1

σ2i = ‖ϕ‖2‖v‖2.

Hence, ‖Φ‖ = ‖ϕ‖. Now, since Φ is continuous, it can uniquely be extended to V1 ⊗ V2 byLemma 2.2.

Analogous statements hold true for more than two Hilbert spaces with regard to theisometries in Eq. (2.1), as well as for multiple ϕ, considering compositions of induced maps.The bilinear version is similar:

Lemma 2.6. Let V1, V2, W1,W2 and H be Hilbert spaces. Further, let ϕ : V2 ×W1 → H bea continuous, bilinear form. Then there is a unique continuous, bilinear form

Φ : V1 ⊗ V2 ×W1 ⊗W2 → V1 ⊗H ⊗W2,

induced by

Φ(v1 ⊗ v2, w1 ⊗ w2) := v1 ⊗ ϕ(v2, w1)⊗ w2.

The operator norm of Φ is the same as of ϕ.

Page 32: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

16 2.3. Tensor Node Arithmetic

Proof. By the universal property, the bilinear form Φ is uniquely defined on the dense,algebraic tensor product subspaces V1 ⊗a V2 ×W1 ⊗aW2. We now show that it is bounded.Let v ∈ V1 ⊗a V2 and w ∈ W1 ⊗a W2 be arbitrary. Then they have finite singular valuesdecompositions

v =

r1∑

i=1

σiv(1)i ⊗ v

(2)i , w =

r2∑

j=1

θiw(1)j ⊗ w

(2)j ,

for orthonormal systems v(1)i i, v

(2)i i, w

(1)j j and w(2)

j j . We hence have that

‖Φ(v, w)‖2 = ‖r1,r2∑

i,j=1

σiθj v(1)i ⊗ ϕ(v

(2)i , w

(1)j )⊗ w(2)

j ‖2

=

r1,r2∑

i,j=1

σ2i θ

2j 〈ϕ(v

(2)i , w

(1)j ), ϕ(v

(2)i , w

(1)j )〉U ≤ ‖ϕ‖2

r1∑

i=1

σ2i

r2∑

j=1

θ2j = ‖ϕ‖2‖v‖2‖w‖2

Hence, ‖Φ‖ = ‖ϕ‖. Now, since Φ is continuous, it can uniquely be extended to V1 ⊗ V2 ×W1 ⊗W2 by Lemma 2.3.

2.3 Tensor Node Arithmetic

In this section, we formally introduce the graphical interpretation behind certain elementarytensor calculus. Let Hα and Hβ be Hilbert spaces3 over R. In the following, we depict vectorssuch as

vi ∈ Hα, wi ∈ Hβ , i ∈ N,

as nodes, drawn as circles, with legs attached to them, drawn as line segments. To each leg,we assign a label in order to clarify to which Hilbert space it belongs:

viα ∈ Hα, wiβ ∈ Hβ

These can likewise be thought of as the indices or variables of the vector if the correspondingHilbert space is a function space Hα ⊂ RΩα for some set Ωα. Products between these vectorscan also be visualized, as we discuss in the following section.

2.3.1 Three Fundamental Products and their Graphical Interpre-tation

A large fraction of operations related to so-called tensor networks consisting of elements inHilbert spaces can be broken down to three elementary products:

• The bilinear (topological) tensor product as in Section 2.2:

⊗ : Hα × Hβ → Hα ⊗ Hβ =: Hα,β.

The legs attached to the vectors here remain independent,

v1α ⊗ w1β = v1α w1

β = v1 ⊗ w1α β

3We briefly consider complex Hilbert spaces in Section 2.8.2, as well as linear and bilinear forms onHilbert spaces in Section 2.8.1.

Page 33: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 17

such that each vector in the tensor product space has two legs, or alternatively, one leg withthe set α, β assigned to it.

• The (induced) scalar product on Hγ , for an index γ ∈ α, β or set γ = α, β:

〈·, ·〉γ : Hγ × Hγ → R, (2.9)

This multiplication, also called contraction, may be visualized as an edge between two vectorsu1, u2 ∈ Hγ which connects their two legs:

〈 u1γ , u2

γ 〉γ = u1 u2γ = 〈u1, u2〉γ ∈ R (2.10)

For γ = α, β, the edge between u1 and u2 may also be depicted by two lines, correspond-ing to legs affiliated to Hα and Hβ .

• Remaining is an optional product which we denote as γ , for γ ∈ α, β:

γ : Hγ × Hγ → Hγ . (2.11)

We assume γ to be a continuous, commutative and associative, bilinear product, whichadditionally fulfills

〈u1 γ u2, u3〉γ = 〈u1, u2 γ u3〉γ (2.12)

for all possible u1, u2, u3 ∈ Hγ . Since the two vectors interact, but the leg labeled γ remains,its visualization is as follows:

u1γ γ u2

γ =u1 u2

γ

= u1 γ u2γ ∈ Hγ (2.13)

The Hadamard product on the euclidean spaces V = Rn, n ∈ N, or V = `2(R),

(x y)i := xi · yi, ∀i, x, y ∈ V,

is such a product and, if not otherwise mentioned, the default choice. However, on theother hand, the entrywise product of two square-integrable, unbounded functions need notnecessarily be square-integrable. For simplicity, we will only consider cases in which thedomain of γ does not need to be restricted4. As for the scalar product, there is then alsoa notion of an induced product α,β on Hα ⊗ Hβ , to which we get back in Section 2.3.5.

2.3.2 Matrix Spaces and their Graphical Interpretation

In later context, a distinct role is taken by Hilbert-Schmidt operators from W = Hα toV = Hα (cf. Section 2.2.2), in particular for Hα = Rn, with regard to their interpretation aselements in Hα⊗Hα ∼= Hα⊗H∗α. Such mappings can be interpreted as (infinitely large) squarematrices (with finite Hilbert-Schmidt/Frobenius norm). In order to define their action ontoone another, it suffices by Lemma 2.6 to consider elementary tensors, or rank one matrices.The ordinary multiplication of such matrices is given by

∗ : Hα ⊗ Hα × Hα ⊗ Hα → Hα ⊗ Hα,

(v1 ⊗ v2) ∗ (v3 ⊗ v4) = 〈v2, v3〉α · (v1 ⊗ v4),

4Although, in fact, many algebraic properties that involve this product likewise hold true on that restricteddomain.

Page 34: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

18 2.3. Tensor Node Arithmetic

resulting in a new elementary matrix, as visualized below:

v1 ⊗ v2α α ∗ v3 ⊗ v4α α (2.14)

= v1 ⊗ v2 v3 ⊗ v4αα α

= 〈v2, v3〉α · v1 ⊗ v4α α

The matrix times vector multiplication and vector times matrix multiplication are analogousand induced by

∗ : Hα ⊗ Hα × Hα → Hα, ∗ : Hα × Hα ⊗ Hα → Hα,

(v1 ⊗ v2) ∗ v3 = v1 · 〈v2, v3〉α, v1 ∗ (v2 ⊗ v3) = 〈v1, v2〉α · v3.

The first kind can be visualized as

v1 ⊗ v2α α ∗ v3α (2.15)

= v1 ⊗ v2 v3αα

= v1α · 〈v2, v3〉α .

As previously carried out for the scalar product, one may generalize the (Hadamard)product (if existent) to a matrix times vector multiplication

~ : Hα ⊗ Hα × Hα → Hα ⊗ Hα, ~ : Hα × Hα ⊗ Hα → Hα ⊗ Hα,

(v1 ⊗ v2)~ v3 = v1 ⊗ (v2 α v3), v1 ~ (v2 ⊗ v3) = (v1 α v2)⊗ v3,

which can also be depicted as network:

v1 ⊗ v2α α ~ v3α (2.16)

=v1 ⊗ v2α v3

α

= v1α ⊗ v2 α v3α

2.3.3 Multisets and Singleton Domain Functions

In multisets, elements are allowed to appear multiple times (as in Eq. (2.17) below). Eachone can hence be interpreted as function f : S → N0 that maps each element in an ordinaryset S to its multiplicity within the multiset. The union of multisets, their difference and therestriction to a set U ⊂ S are defined via

(f ] g)(s) := f(s) + g(s), (f \ g)(s) := max(0, f(s)− g(s)),

f |U (s) :=

f(s) if s ∈ U,0 otherwise

, ∀ s ∈ S,

Page 35: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 19

respectively. The cardinality of f is defined as |f | =∑s∈S f(s). By writing f = α 7→ i,

we denote a function with singleton domain, f : α → N0, f(α) := i. It is often moreconvenient to write out f , based on the definition

α, . . . , α# =

i⊎

`=1

α# := α 7→ i,

where the index # distinguishes multisets from ordinary ones. These notations can essen-tially be summarized through

(α 7→ 2) ] (β 7→ 1) = α, α, β#. (2.17)

At any time, an ordinary set may be interpreted as multiset, such that α]α = α, α#.

2.3.4 Singleton Node Arithmetic

We define the i-fold tensor product Hilbert space

H⊗iα := Hα ⊗ . . .⊗ Hα (i times), H⊗0α := R. (2.18)

There is a pattern emerging from the fundamental operations on vector and matrix spacesin Sections 2.3.1 and 2.3.2. In the following definition, the previously carried out calcu-lus is merged together using different indices to represent different multiplications. Thesecorrespond as follows:

∅α ≡ 〈·, ·〉α — contract,

α∅ ≡ α — keep,

∅∅ ≡ ⊗.

As indicated in Sections 2.1.1 and 2.1.2, we will not only use α as index of the Hilbert spaceHα, but formally assign it as mode label. For a (real) Hilbert space Hα, we introduce theunivariate tensor node space with mode label α as

Nα := N(Hα) :=⋃

i∈N0

H⊗iα × α 7→ i.

To each i-fold Hilbert space, we hence equip the multiset α 7→ i = α, . . . , α# of cardinalityi. This pairing is a bookkeeping trick, such that when context is provided, and in particularin Sections 2.5 and 3.4 and Chapters 4 to 8, we no longer distinguish between H⊗iα ×α 7→ iand H⊗iα

5. We will further only make direct use of these spaces for i ∈ 0, 1, 2 (scalars,vectors and matrices), whereas tensors have mixed mode labels.

Remark 2.7. For spaces for which no -product is defined, the set k as in Definition 2.8below is assumed to be empty. The same is assumed for all forthcoming definitions withoutfurther mentioning.

Definition 2.8 (Singleton node product). Let c and k be two sets for which c] k ⊂ α#6.We define

kc : Nα ×Nα → Nα

via pairs of elementary tensors and label sets

(v1 ⊗ . . .⊗ vf(α), f)kc (vf(α)+1 ⊗ . . .⊗ vf(α)+g(α), g) := (u, h),

5Just as one seldomly writes out the pair (V, ‖ · ‖V ) to describe a normed vector space.6So either k = ∅, c = ∅ or k = α, c = ∅ or k = ∅, c = α.

Page 36: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

20 2.3. Tensor Node Arithmetic

for vi ∈ Hα, i = 1, . . . , f(α) + g(α) — subject to the following rules:

• If f(α) = 0 or g(α) = 0, then u is the usual product of a scalar and a tensor, with(unmodified) labels h := α 7→ f(α) + g(α).• Otherwise,

u :=

f(α)−1⊗

s=1

vs ⊗ Φ(vf(α), vf(α)+1)⊗g(α)+f(α)⊗

s=f(α)+2

vs

and h := α 7→ f(α) + g(α)− s for

(Φ , s) =

(〈·, ·〉α , 2) if α ∈ c,(α , 1) if α ∈ k,(⊗ , 0) otherwise.

Within each Hilbert space H⊗iα , the product kc on the whole space is induced through multi-linearity and continuity (cf. Lemma 2.6).

We have seen several examples for this function in Sections 2.3.1 and 2.3.2. The multi-plication ∅α for

• f = α# and g = α# is shown in Eq. (2.10) and results in u = 〈v1, v2〉α, h = ∅• f = α, α# and g = α, α# is shown in Eq. (2.14) and results in u = 〈v2, v3〉(v1⊗v4),h = α, α#

• f = α, α# and g = α# is shown in Eq. (2.15) and results in u = v1 · 〈v2, v3〉,h = α#

The multiplication α∅ for

• f = α# and g = α# is shown in Eq. (2.13) and results in u = v1α v2, h = α#• f = α, α# and g = α# is shown in Eq. (2.16) and results in u = v1 ⊗ (v2 α v3),h = α, α#

The product allows for comprehensive rules regarding commutativity (ordering) and as-sociativity (change of parenthesis):

Lemma 2.9 (Commutativity). Let N1, N2 ∈ Nα, for Ni = (vi, fi), i = 1, 2.If f1(α) ≤ 1 and f2(α) ≤ 1, then

N1 kc N2 = N2 kc N1,

given k ] c = α7.

Proof. Let f1(α) = 1 and f2(α) = 1. Otherwise, there is nothing to show. If k = α, c = ∅,then

N1 kc N2Def. 2.8

= (v1 α v2, α#) = (v2 α v1, α#)Def. 2.8

= N2 kc N1,

since α is by assumption commutative (cf. Section 2.3.1). If k = ∅, c = α, then

N1 kc N2Def. 2.8

= (〈v1, v2〉α, ∅) = (〈v2, v1〉α, ∅) Def. 2.8= N2 kc N1,

since 〈·, ·〉α is a scalar product, hence symmetric.

7So either k = α, c = ∅ or k = ∅, c = α.

Page 37: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 21

Associativity can be more complicated, but some rules take simpler forms as in the laterProposition 2.17. The visualization becomes very helpful here, as a product is associativeif and only if it can consistently be drawn as a network like in Section 2.3.2. A productis commutative on the other hand, if the order of nodes in the network is irrelevant (cf.Fig. 2.2).

Lemma 2.10 (Associativity). Let N1, N2, N3 ∈ Nα, for Ni = (v(i), fi) with fi(α) > 0,i = 1, 2, 3.• If f2(α) ≥ 2, then for all k ] c, h ] e ⊂ α it holds that

(N1 kc N2)he N3 = N1 kc (N2 he N3). (2.19)

• If f2(α) = 1, then

(N1 α∅ N2)α∅ N3 = N1 α∅ (N2 α∅ N3), (2.20)

(N1 α∅ N2)∅α N3 = N1 ∅α (N2 α∅ N3). (2.21)

Proof. Let v(i) = v(i)1 ⊗ . . .⊗ v

(i)fi(α), i = 1, 2, 3. The first case Eq. (2.19) follows by definition

since according parts of the elementary tensors do not interact,

(N1 kc N2)he N3

= (

f1(α)−1⊗

i=1

v(1)i ⊗ Φ1(v

(1)f1(α), v

(2)1 )⊗

f2(α)−1⊗

i=2

v(2)i ⊗ Φ2(v

(2)f2(α), v

(3)1 )⊗

f3(α)⊗

i=2

v(3)i , α, . . . , α#)

= N1 kc (N2 he N3)

where Φ1 and Φ2 depend on k, c and h, e, respectively.The second one Eq. (2.20) follows by associativity of α (cf. Section 2.3.1),

(N1 α∅ N2)α∅ N3

= (

f1(α)−1⊗

i=1

v(1)i ⊗ (v

(1)f1(α) α v

(2)1 α v(3)

1 )⊗f3(α)⊗

i=2

v(3)i , α, . . . , α#)

= N1 α∅ (N2 α∅ N3),

where we used that (v(1)f1(α) α v

(2)1 )α v(3)

1 = v(1)f1(α) α (v

(2)1 α v(3)

1 ).

The last one, Eq. (2.21), is analogous to the previous assertion, and relies on the assumption

Eq. (2.12), which here yields 〈v(1)f1(α) α v

(2)1 , v

(3)1 〉α = 〈v(1)

f1(α), v(2)1 α v(3)

1 〉α.

Note that there are similar, easy rules in case of fi(α) = 0 for one i ∈ 1, 2, 3, sincethen one of the factors is a scalar.

2.3.5 Label Set Node Arithmetic

The key idea of the tensor node arithmetic is to let the multiplication act independentlyon tensor products of different Hilbert spaces, which here means that we label it with adifferent mode label.

Remark 2.11. From now on, the previous singleton α usually denotes an ordered label setα = α1, . . . , αd, with an abstract ordering α1 < . . . < αd. Further, each Hγ , γ ∈ α, isa (real) Hilbert space with a pair (〈·, ·〉γ ,γ) that fulfills the assumptions carried out inSection 2.3.1.

Page 38: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

22 2.3. Tensor Node Arithmetic

What has previously been carried out for the singleton α, is now generalized to the modelabel set α. We use further Greek letters, in particular β and γ, for other mode labels, someof which often carry particular roles. For any multiset of mode labels f : α 7→ N0, we define

Hf := H⊗f(α1)α1

⊗ . . .⊗ H⊗f(αd)αd

. (2.22)

So, for example, it is Hα1,α1,α2# = H⊗2α1⊗ Hα2

. The tensor node space as defined belowis the foundation for tensor networks, which in this context are collections of tensor nodesthat are interlinked through the action of the node product.

Definition 2.12 (Tensor node space). We define the space of tensor nodes (with modelabel(s) α) as

Nα := N(Hα1, . . . ,Hαd) :=

f∈Nα0

Hf × f. (2.23)

The space Nα0 is the set of all functions (multisets) f : α1, . . . , αd → N0.

The summation of two elements in Nα with equal mode labels is straightforward,

a(v1, f) + b(v2, f) := (av1 + bv2, f) ∈ Nα, (2.24)

for a, b ∈ R. Thus, Nα is a collection of Hilbert spaces Hf × f8.

Definition 2.13 (Tensor node product). Let C ]K ⊂ α. We define

KC : Nα ×Nα → Nα,

for (semi-)elementary tensors

(v(1) ⊗ . . .⊗ v(d), f)KC (w(1) ⊗ . . .⊗ w(d), g) := (u(1) ⊗ . . .⊗ u(d), h),

(u(i), h|αi) := (v(i), f |αi)K|αiC|αi

(w(i), g|αi),

for all v(i) ∈ H⊗f(αi)αi , w(i) ∈ H

⊗g(αi)αi and u(i) ∈ H

⊗h(αi)αi , i = 1, . . . , d, and f, g, h ∈ Nα0 .

Since for fixed f and g, the product KC is a composition of functions as in Lemma 2.6,it induces a unique, continuous, multilinear function on the whole of Hf ×Hg. Thereby, it isunique and well-defined on the whole of Nα ×Nα. This product is related to partial scalarproducts as in [46, Section 4.5.4]9. Let v1, w1 ∈ Hα1 and v2 ∈ Hα2 as well as y2 ∈ H⊗2

α2. The

following diagrams demonstrate Definition 2.13 in two examples:

v1 ⊗ v2α1 α2 α1α2

w1 ⊗ y2α1

α2

α2 = v1 w1α1 ⊗ v2 ∗ y2 α2

v1 ⊗ v2α1 α2 α2α1

w1 ⊗ y2α1

α2

α2 = 〈v1, w1〉 · v2 ~ y2α2 α2

The second component in Nα solely keeps track of the multiplicities of the Hilbert spacesand is, as mentioned above, a bookkeeping trick. In particular in later sections, we hold onto the following notations:

Notation 2.14 (Mode labels and node value). Let N = (v, f) ∈ Nα. We define

m(N) := f

8From an implementational perspective, it is interesting that one may define a free vector space over Nα

modulo the addition as in Eq. (2.24). However, we are not going to make use of it.9Therein, Corollary 4.131 in fact implicitly makes use of globally assigned labels contained D.

Page 39: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 23

as the mode labels of the tensor node N and v as the value of N . When context is clear, orv ∈ R, we identify N = v as well as Hf × f ⊂ Nα with Hf . Furthermore, we may definethe mode labels of a node and the space it belongs to implicitly by writing N = N(f) ∈ Hf .For any mode labels (or set of mode labels) f ,

d(f) := dim(Hf ) ∈ N0 ∪ ∞

denotes the dimension of the Hilbert space Hf .

In the finite-dimensional case, it thus holds that d(α) =∏dµ=1 d(αµ).

For example, by writing N = N(α1, α1, α2) ∈ Rn1×n1×n2 , we denote N ∈ N(Hα1 ,Hα2)for Hα1

= Rn1 and Hα2= Rn2 , such that d(α1) = n1, d(α2) = n2 ∈ N. Then further,

m(N) = f = α1, α1, α2#, that is m(N)(α1) = f(α1) = 2 and m(N)(α2) = f(α2) = 1.

In particular nodes N = N(γ) with a single mode label, or nodes N = N(γ, γ) with oneduplicate label, γ ∈ α, can be treated as ordinary vector or matrix, respectively, as long asthere is no ambiguity about their modes.

The previous commutativity and associativity rules are, as acts independently on dif-ferently labeled Hilbert spaces, directly transferred to the space Nα. For example, forN1 = N1(α1, α2), N2 = N2(α1, α2, α2) and N3 = N3(α1), we have (N1 α1

α2N2)α1,α2 N3 =

N1 ∅α1,α2(N2 α1

α2N3).

In the following definition, we extend the product to s-tuples of nodes as in the exem-plary Fig. 2.2. The rules of this multiplication are in principle simple, and remain isolated

N1 N2 N3

α2

α1 α1

α2

α2

Figure 2.2: Visualization of B = ∅α(N1, N2, N3) = (N1 α2α1 N2) ∅α1,α2

N3 = N1 ∅α1,α2(N2 α2

α1 N3),B = B(α2, α2), for N1 = N1(α1, α2, α2), N2 = N2(α1, α1, α2) and N3 = N3(α1, α2, α2). The upper graylayer illustrates the independent multiplication for α1, the lower one for α2.

regarding different mode labels. All common mode labels lead to one contraction, if notseparated through a node with duplicate mode labels (as by N2 regarding α1 in Fig. 2.2,or N1, N3 regarding α2), analogously to how matrices act, as described as in Section 2.3.2.The formal definition is slightly technical:

Definition 2.15 (Tensor node products on tuples). Let C ]K ⊂ α. We define

KC : Nα × . . .×Nα → Nα

for Ni = (vi, fi), i = 1, . . . , s, via

KC(N1, . . . , Ns

)=(. . .(N1 k1∪KC\k1 N2

)k2∪KC\k2 . . .

)ks−1∪KC\ks−1

Ns

where ki = ∅ if i = s− 1 and otherwise

ki(αµ) =

0 if f1(αµ) = . . . = fi(αµ) = 0,

0 if fi+1(αµ) 6= 1,

0 if fi+2(αµ) = . . . = fd(αµ) = 0,

1 otherwise,

(2.25)

Page 40: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

24 2.3. Tensor Node Arithmetic

for µ = 1, . . . , d and i = 1, . . . , s− 210.

For a pair of nodes, s = 2, the product remains the same as its earlier definition: N1KCN2 = KC (N1, N2), since then k1 = ∅. In the example in Fig. 2.2, we have that k1 = α2and k2 = ∅. Mostly, all modes common between two nodes are supposed to be contracted.We define this as the default case:

Definition 2.16 (Default cases). Let N = (N1, . . . , Ns) ∈ Nsα be an s-tuple of tensor nodes.

We define

(N) := ∅M (N),

C(N) := ∅C(N), K (N) := KM\K(N),

\P (N) := ∅M\P (N), K\P (N) := KM\(K∪P )(N),

for

M(αµ) =

1 if

∑si=1 m(Ni)(αµ) ≥ 2,

0 otherwise.

In principal, the definition of also allows to just set M := α, yielding the same resultin all above cases. We arrive at one of the main motivations for the tensor node arithmetic— the case we face in Chapter 3 when discussing tree tensor networks. In Section 2.7.2, wefurther show that any product can be traced back to this situation.

Proposition 2.17. Let N1, . . . , Ns ∈ Nα be tensor nodes and K ⊂ α. If m(Ni)(γ) ≤ 1for all γ ∈ α and further

∑si=1 m(Ni)(γ) ≤ 2 for all γ /∈ K, i = 1, . . . , s, then

K(N1, . . . , Ns) = Nπ(1) K . . .K Nπ(s)

for all permutations π, which means that the product is fully associative and commuta-tive.

Proof. Let Ni = (vi, fi), fi = m(Ni), for i = 1, . . . , s. First, we have ki = ∅ for each suchset appearing in Definition 2.15 since not all three conditions for ki(αµ) = 0 in Eq. (2.25)can be false (as αµ only appears in at most two nodes). Thereby, by definition,

K(N1, . . . , Ns) =((K (N1, . . . , Ns−2)

)K Ns−1

)K Ns

=(. . .(N1 K N2

)K . . .

)K Ns. (2.26)

We define the partial products H1 := K(Nτ(1), . . . , Nτ(i)), H2 := K(Nτ(i+1), . . . , Nτ(j))and H3 := K(Nτ(j+1), . . . , Nτ(`)) for 1 ≤ i < j < ` ≤ s and an arbitrary permutation τ .For each single µ = 1, . . . , d, at least one of m(H1), m(H2) and m(H3) does not contain themode label αµ. Since thereby (trivial, mode label-wise) associativity rules apply, we have

(H1 K H2)K H3 = H1 K (H2 K H3),

and all brackets in Eq. (2.26) can be rearranged. Further, any two Ni and Nj , fulfill theconditions for Lemma 2.9 in each mode label (considering the default case multiplicationDefinition 2.16). Thereby,

Ni K Nj = Nj K Ni, ∀1 ≤ i 6= j ≤ s.

The order of nodes can hence likewise be rearranged. This was to be shown.

10The sets ki are not uniquely determined, considering that for example (a, ∅)α1 (v, α1#) = (a, ∅)α1

(v, α1#), for a ∈ R, v ∈ Hα1 .

Page 41: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 25

There is a reason why we have used the same symbol for both the element, the singletonand the set α:

Remark 2.18 (Division of mode labels). Given the single Hα1, . . . ,Hαd , the tensor product

space Hα = Hα1⊗ . . .⊗Hαd is again a Hilbert space with an induced scalar and α product.

This gives us a simple embedding N(Hα) → Nα = N(Hα1, . . . ,Hαd). We hence do not

necessarily need to distinguish whether α is one or a set of labels, in the sense that thefollowing diagram commutes:

N(Hα)×N(Hα) Nα ×Nα

N(Hα) Nα

kc

→k

c

for k ] c ⊂ α#.

This assertion can consistently be visualized as subdivision of legs

N = N(α) = Nα = N(α1, . . . , αd) = N

α1

α2.

So far, we have considered one superset of mode labels α = α1, . . . , αd. Other, distinctmode label sets such as β are handled as follows.

Notation 2.19 (Distinct mode labels). When we write α < β, then this implies the abstractordering αi < β for all i. Further, once β = β1, . . . , βk is treated as set, then this meansαi < βj < βj+1 for all i, j and in particular that the two mode label( set)s are distinct,|α ∪ β| = d + k. If not otherwise implied, we assume different (Greek) letters to denoteformally different mode labels.

Since the embeddings Nα,Nβ → Nα∪β are trivial, we will skip this formal step whenintroducing or multiplying two corresponding nodes11. When we are only interested in thestructure of a multiplication, we do not necessarily have to specify the involved Hilbertspaces, since in that case, only the mode labels are relevant.

Let λ be a further distinct mode label, β < λ. The default multiplication of two nodesN1 = N1(α,β) = N1(α1, . . . , αd, β1, . . . , βs) and N2 = N2(β,λ), since α, β and λ arepairwise disjoint, then gives a node T := N1 N2 = N1 β N2, with T = T (α,λ). Indiagrams, such as Fig. 2.3, we use shaded areas to visualize (partial) products, where themultiplication is uniquely defined through the legs attached to the nodes, or in other words,through the corresponding graph (formalized in Section 3.1). Given such a graph, we maynot even have to specify mode labels.

N1 N2

N1 N2

α β λ

Figure 2.3: The graph corresponding to a product of two nodes.

The matrix transpose is generalized to nodes with duplicate mode labels (i.e. if m(N)(γ) > 1for some γ ∈ α).

11Just like we multiply polynomials in x and y.

Page 42: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

26 2.3. Tensor Node Arithmetic

Definition 2.20 (Node transpose). We define the transpose of a node N = (v, f) ∈ Nα,

v =⊗d

µ=1

⊗f(αµ)s=1 v

(µ)s , via

NT = (v, f) ∈ Nα,

v =

d⊗

µ=1

f(αµ)⊗

s=1

v(µ)f(αµ)+1−s

We have the usual rule (NM)T = MT NT for all N,M ∈ Nα. Here, the transpose isjust a reversion of order, acting separately for each mode label (the ordering of mode labelsitself is not changed)12.

Definition 2.21 (Norm of a node). We define the norm of a node N = (v, f) ∈ Nα via

‖N‖ = ‖v‖f , where ‖ · ‖f is the induced norm in Hf = H⊗f(α1)α1 ⊗ . . .⊗ H

⊗f(αd)αd .

A meaningful consequence is the simple relation between norm and node product:

Lemma 2.22. Let N ∈ Nα with m(N) ⊂ α. Then ‖N‖2 = N N .

Proof. Let N = (v, f). Then the assertion follows directly by N N = (〈v, v〉f , ∅), where〈·, ·〉f is the induced scalar product on Hf (cf. Remark 2.18) (note that we treat nodes withempty mode label sets as scalars).

Writing ‖N‖2 = NT N may be more familiar. If, as in the above case, m(N) ⊂ α,then this holds true due to NT = N .

2.3.6 Extension of (Multi-)Linear Operators to Tensor Nodes

Maps that act with respect to (a multiple of) only a single Hilbert space Hαµ can be extendedto Nα in a straightforward way:

Lemma 2.23 (Linear extension to tensor nodes). Let i, j ∈ N0, µ ∈ 1, . . . , d and `αµ bea continuous, linear operator,

`αµ : S → H⊗jαµ

for a (dense) subspace S ⊆ H⊗iαµ . Then this operator can (uniquely) be extended to a mapping

Lαµ : N ∈ Nα | m(N)(αµ) ≥ i → N ∈ Nα | m(N)(αµ) ≥ j

such that it acts continuously and multilinearly on nodes with equal mode labels and it holdsthat

Lαµ((v, αµ 7→ i)∅M) := (`αµ(v), αµ 7→ j)∅M

for all (v, αµ 7→ i),M ∈ Nα.

Proof. Follows directly by Lemma 2.5 with regard to the isomorphism Eq. (2.2).

For any N with αµ /∈ m(N), we further set Lαµ(N) := N . If the operator has analogouscounterparts for other mode labels, then for any γ = γ1, . . . , γk ⊂ α, we can define

Lγ :=©ki=1Lγi .

12In the generalizing Section 2.8.1, this transpose were to act as adjoint on the therein discussed functions.

Page 43: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 27

The operator Lα is then essentially the one induced through `α :=⊗d

µ=1 `αµ , in the sense

that for a node N = (v, f) for conform mode label sets f and f , we have

Lα(N ∅M) = (`α(v), f)∅M.

We may omit the index γ in Lγ if we mean to apply it with respect to all mode labels withinγ which appear with required multiplicity. In Section 2.4.5, we use the above constructionto define the partial summation, trace and diagonal operations. With Lemma 2.6, similarextensions are possible for multilinear forms, but we do not carry this out explicitly. It isan obvious step at this point to start assigning mode labels also to functions which are notHilbert-Schmidt operators13 and thereby to generalize the so far carried out arithmetic. Weconsider this briefly in Section 2.8.1.

2.4 Tensor Node Arithmetic for Function Spaces

An important class of Hilbert spaces are function spaces Hαµ ⊂ RΩαµ where Ωαµ ⊂ Rdenotes the domain associated to αµ, µ = 1, . . . , d. The Euclidean spaces Hαµ = Rnµ ,nµ ∈ N with domain Ωαµ = 1, . . . , nµ, µ = 1, . . . , d, take a particular role consideringisometries as in Section 2.4.1. The tensor product space can be interpreted as

Hα = Hα1⊗ . . .⊗ Hαd ⊂ RΩα , Ωα = Ωα1

× . . .× Ωαd ,

such that for elementary tensor products of functions, we have that

(f (1) ⊗ . . .⊗ f (d))(x1, . . . , xd) = f (1)(x1) · . . . · f (d)(xd).

In particular, as introduced in Section 1.2,

Rn1 ⊗ . . .⊗ Rnd ∼= Rn1·...·nd .

In these cases, the mode label α can also be interpreted as index or variable, to which wecan assign a value within the domain Ωαµ , as carried out in Section 2.4.2.

2.4.1 Isometries to Finite-Dimensional Hilbert Spaces

Let Hαµ be a Hilbert spaces of infinite dimensions nµ = d(αµ) ∈ N, µ = 1, . . . , d. For eachone, there is an isometry,

φµ : Rnµ∼=−→ Hαµ , 〈φµ(x), φµ(y)〉αµ = 〈x, y〉Rnµ = xT y,

for all x, y ∈ Rnµ . The αµ product, if defined on Hαµ , consequently yields a new Rnµ

product on Rnµ , given by

Rnµ : Rnµ × Rnµ → Rnµ , xRnµ y := φ−1µ (φµ(x) φµ(y)).

This new product is not necessarily the Hadamard product, but fulfills the required assump-tion Eq. (2.12) also on Rnµ . However, we will only consider situations in which (in eitherRnµ or Hαµ) is indeed the entrywise multiplication.

13Hence, those which cannot be identified as tensor nodes.

Page 44: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

28 2.4. Tensor Node Arithmetic for Function Spaces

The single isometries φµ in turn form isometries φJ := ⊗µ∈Jφµ, J ⊂ D := 1, . . . , d,on tensor products of these Hilbert spaces. In particular for φ := φD, we have

φ : Rn1×...×nd ∼=−→ Hα1⊗ . . .⊗ Hαd ,

φ(t) =

n1∑

i1=1

. . .

nd∑

id=1

ti1,...,id · φ1(ei1)⊗ . . .⊗ φd(eid).

The tensor (or hypermatrix) t can then again be assigned a mode label, or mode labels,respectively. We will further discuss such isometries under more explicit use in Section 6.2.2,in which we then assign mode labels to the descretization space Rn1×...×nd as well.

2.4.2 Indexing and Restrictions

We have already indicated in the introductory Section 2.1.2 that we access entries of partsof nodes using mode labels, as we formally define in Definition 2.25 below. If for exampleN = (t, α1, α1, α2#), t ∈ Rn1×n1×n2 and x ∈ 1, . . . , n1, y ∈ 1, . . . , n1, z ∈ 1, . . . , n2,we have

N(α1 = x, α2 = z) = (t, α1#), (2.27)

where ty = tx,y,z, t ∈ Rn1 . The duplicate mode label α1 is accessed from order left to right,

N(α1 = x, α1 = y) = (t, α2#),

where tz = tx,y,z, t ∈ Rn2 . Restricting a mode label to a subset S ⊂ Ωαµ , the Hilbert space

associated to αµ were to switch from HΩαµ to HS . In order to avoid this problem, we makeuse of the following convention.

Notation 2.24 (Embedding of restricted domains). Let µ ∈ 1, . . . , d be fix. For S ⊂ Ωαµ ,

we treat RS as a subset of RΩαµ subject to the trivial embedding

eµ = eµ,S : RS → RΩαµ ,

eµ(h)(x) =

h(x) if x ∈ S0 otherwise.

For example, given f ∈ RS and g ∈ Hαµ = RΩαµ , we may write 〈f, g〉αµ = 〈eµ(f), g〉αµ .Note that the restriction ·|S is the left-inverse of eµ, that is eµ(h)|S = h for all h ∈ RS . ForS1, S2 ⊂ Ωα1

, |S1|, |S2| 6= 1, this allows us to write

N(α1 ⊂ S1, α1 ⊂ S2) = (t, α1, α1, α2#) ∈ Nα,

where t = t|S1×S2×Ωα2∈ RS1×S2 ⊗ Hα2 .

Definition 2.25 (Restrictions). Let N = (v, f) ∈ Nα. For fixed µ ∈ 1, . . . , d, let furtherS1, . . . , Sk ⊂ Ωαµ be subsets for which the restrictions ·|Sµ , µ = 1, . . . , k, are well-defined.We define

N(αµ ⊂ S1, . . . , αµ ⊂ Sk) := (v, f)

for elementary tensors v =⊗d

µ=1

⊗f(αµ)i=1 v

(µ)i via

v =

µ−1⊗

ν=1

f(αν)⊗

i=1

v(ν)i ⊗

min(k,f(αµ))⊗

i=1

v(µ)i |Si ⊗

f(αµ)⊗

i=k+1

v(µ)i ⊗

d⊗

ν=µ+1

f(αν)⊗

i=1

v(ν)i ,

f = f \⊎

i∈1,...,k, |Si|=1

αµ.

Page 45: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 29

The map on the whole space is again induced through multilinearity, since the single restric-tions are linear. For singletons Sµ = xµ, we also write

N(αµ = xµ) := N(αµ ∈ xµ).

Note that restrictions to singleton sets, according to the definition above, reduce the multi-plicity of the mode label αµ.

Since restrictions on disjoint mode label sets commute, we use abbreviated forms as inthe initial example Eq. (2.27). For example, for γ = α1, α2,

N(γ = (x, y)) := N(α1 = x, α2 = y) := N(α1 = x)(α2 = y). (2.28)

Note that later, when we use mode labels other than α, the expression N(γ = (x, y)) inEq. (2.28) is still well-defined, since we assume mode labels to be ordered. Restricting amode that does not appear in N is allowed for convenience and does, per definition, nothave any effect, e.g. for m(N) = α1, α1, α2# as above, N(α3 ⊂ S) = N .

The situation is in general simple when a node does not have duplicate mode labels.Let N = N(α) = N(α1, . . . , αd), x ∈ R1×k, y ∈ R1×(d−k) and γ = γ1, . . . , γk ⊂ α,γ1 < . . . < γk, k = |γ|. Then

a := N(α \ γ = y) = N(γ = x)(α \ γ = y) = N(α \ γ = y)(γ = x).

where N = N(γ = x) = N(γ1 = x1, . . . , γk = xk) = N(γ1 = x1) . . . (γk = xk):

Nγ α \ γ →· (γ=x) N

α \ γ →· (α\γ=y) a

For N = N(γ, γ), γ ∈ α, we also abbreviate

N((γ, γ) = (x, y)) := N(γ = x, γ = y),

N((γ, γ) ∈ S1 × S2) := N(γ ∈ S1, γ ∈ S2). (2.29)

Note that N(γ = x, γ = y) = (N(γ = x))(γ = y) 6= N(γ = y, γ = x) and N(γ ∈ S1)(γ ∈S2) 6= N(γ ∈ S1, γ ∈ S2).

2.4.3 Unfoldings

A d-dimensional tensor t ∈ Rn1×...×nd , n1, . . . , nd ∈ N, as in Fig. 2.4

∈ Rn1×...×n4

n1

n2n3

n4

Figure 2.4: ([68]Kr19) Visualization of a 4−dimensional (rank 1) tensor t with mode sizes n1 = . . . = n4 = 2.

can be reshaped into matrices

t(J) ∈ RnJ×nD\J (2.30)

for subsets J ⊂ D := 1, . . . , d, where nJ :=∏j∈J nj .

Page 46: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

30 2.4. Tensor Node Arithmetic for Function Spaces

n2n3n4

t(1) : n1

n2n4

t(1,3) : n1n3

n3

t(1,2,4) : n1n2n4

Figure 2.5: ([68]Kr19) Matricizations (or reshapings, or unfoldings) of the 4th-order tensor t into thematrices t(1) ∈ Rn1×n2n3n4 , t(1,3) ∈ Rn1n3×n2n4 and t(1,2,4) ∈ Rn1n2n4×n3 .

We use the same notation for unfoldings, or reshapings, as in [40], but alternatively use sub-sets of mode labels instead of subsets in 1, . . . , d. So once the tensor is interpreted as node,which we here still explicitly distinguish from one another by means of T = T (α1, . . . , αd) =(t,α), then simply

t(J) = T (αJ ), αJ := αjj∈J . (2.31)

This is more convenient if the mode labels of the to be reshaped object may vary. A niceconsequence is Lemma 2.27 which shows the relation between unfoldings and the tensor nodeproduct . For arbitrary Hilbert spaces Hf , f = γ(1) ] . . . ] γ(s), the unfolding operationcan be defined as isometry from the space of nodes with mode label f to an ordinary tensorproduct14,

·(γ(1)),...,(γ(s)) : Hf × f∼=−→ Hγ(1) ⊗ . . .⊗ Hγ(s) ,

but we only require it for Euclidean spaces as follows. The above, more general definitionfor s = 2 corresponds to the matricization operator in [46, Definition 5.3].

Definition 2.26 (Unfoldings). Let Hαµ = Rnµ , µ = 1, . . . , d. Further, let N ∈ Nα and

]si=1γ(i) = m(N), γ(i) ⊂ α. We define the unfolding of N with respect to (γ(1), . . . ,γ(s))

as the conventional tensor15

t = N (γ(1)),...,(γ(s)) ∈ Rd(γ(1))×...×d(γ(s))

tτ1(j(1)),...,τs(j(s)) := N(γ(1) = j(1)) . . . (γ(s) = j(s))

where d(γ(i)) =∏ki`=1 d(γ

(i)` ), γ(i) = γ(i)

1 , . . . , γ(i)ki, γ(i)

1 < . . . < γ(i)ki

and each τi :

×ki`=11, . . . , d(γ

(i)` ) → 1, . . . , d(γ(i)) is the bijection given by co-lexicographic ordering.

We used again that each γ(i) is ordered as subset of α. As γ(s) = m(N) \ ⊎s−1i=1 γ

(i),this last set may not need to be mentioned. Duplicate mode labels in any of the γ(i) arenot allowed.

The case when N = N(αµ, αµ), Hαµ = Rnµ , requires particular attention. With regardto Notation 2.24, we set

N((αµ, αµ) ⊂ 1, . . . , nµ × 1, . . . , nµ)(αµ),(αµ) ∈ Rnµ×nµ (2.32)

to not be a square matrix, if nµ < nµ.

14Which thus only makes a difference as long as it is not again interpreted as node space by means ofNotation 2.14.

15tensor, matrix, vector or scalar

Page 47: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 31

Lemma 2.27 (Relation of unfoldings and node product). Let Hαµ = Rnµ , µ = 1, . . . , d.

Further, let N1 = N1(δ(1),γ), N2 = N2(δ(2),γ) ∈ Nα for δ(1), δ(2),γ ⊂ α. Then

(N1 N2)(δ(1)),(δ(2)) = N(δ(1)),(γ)1 ·N (γ),(δ(2))

2 ∈ Rd(δ(1))×d(δ(2)).

Proof. By default, N1 N2 = N1 ∅γ N2. The rest follows directly by definitions.

2.4.4 Node Multiplication as Summation and Integration

The multiplication of more than two nodes takes a compact form using restrictions, thoughineffective from a computational perspective. Note that restrictions regarding mode labelswhich do not appear in a node, do not have any effect.

Lemma 2.28 (Compact multiplication). Let Hαµ = Rnµ , µ = 1, . . . , d. Further, let thenodes N1, . . . , Ns ∈ Nα fulfill m(Ni) ⊂ α, i = 1, . . . , s, and let K ⊂ α, as well as

M := K(N1, . . . , Ns).

Then for δ = δ1, . . . , δu = m(M) ⊂ α, δ1 < . . . < δu and γ = γ1, . . . , γk = α \ δ,γ1 < . . . < γk, we have

M(δ = x) =

d(γ1)∑

y1=1

. . .

d(γk)∑

yk=1

s∏

i=1

Ni(δ = x,γ = y)

for all x = (x1, . . . , xu) ∈×u

i=11, . . . , d(δi).

Proof. Without loss of generality, we can assume K = δ. Due to multilinearity, it suffices

to consider elementary tensors. For each i = 1, . . . , k, let J (i) = j(i)1 , j

(i)2 , . . . , j

(i)ci = k ∈

1, . . . , s | γi ∈ Nk be a sorted set of indices. Due to the given assumptions, we can write

Ni = Wi ∅ Vi,Wi := (

µ∈1,...,u:δµ∈m(Ni)

w(i)µ , δ ∩m(Ni)), Vi := (

µ∈1,...,k:γµ∈m(Ni)

v(i)µ , γ ∩m(Ni))

for w(i)µ , v

(i)µ ∈ Hδµ . Then, for M = (vM , fM ),

vM = a · p :=

k∏

`=1

〈v(j(`)1 )

` . . . v(j(`)c`−1)

` , v(j(`)c`

)

` 〉γ` ·u⊗

µ=1

i∈1,...,s:δµ∈m(Ni)

w(i)µ

Since

〈v(j(`)1 )

` . . . v(j(`)c`−1)

` , v(j(`)c`

)

` 〉γ` =

d(γ`)∑

y`=1

v(j

(`)1 )

` |y` · . . . · v(j(`)c`

)

` |y`

and

s∏

i=1

Vi(γ = y) =

s∏

i=1

`:γ`∈m(Ni)

v(i)` |y` ∈ R,

it follows

a =

d(γ1)∑

y1=1

. . .

d(γk)∑

yk=1

s∏

i=1

Vi(γ = y) ∈ R.

Page 48: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

32 2.4. Tensor Node Arithmetic for Function Spaces

Further,

p|x =

d∏

µ=1

i:δµ∈m(Ni)

w(i)µ |xi =

s∏

i=1

Wi(δ = x).

With Ni(δ = x) = Wi(δ = x)∅ Vi and

M(δ = x) = a · p|x =

d(γ1)∑

y1=1

. . .

d(γk)∑

yk=1

s∏

i=1

Vi(γ = y) ·Wi(δ = x),

we arrive at the given statement.

Lemma 2.28 can easily be adapted to L2 functions (square-integrable real-valued func-tions):

Corollary 2.29. In the situation of Lemma 2.28, let instead Hαµ = L2(Ωαµ) be the space ofsquare-integrable functions, µ = 1, . . . , d. Then if the following integrals converges, we have

M(δ = x) =

Ωγ1

. . .

Ωγk

s∏

i=1

Ni(δ = x,γ = y) dγk . . . dγ1

for (almost) all x = (x1, . . . , xu) ∈×u

i=11, . . . , d(δi).

One has to be careful here since the restriction to a point Ni(δ = x,γ = y) is onlymeaningful within in integral, or with respect to a set of variables x with nonzero measure.

2.4.5 Partial Summation, Trace and Diagonal Operations

Maps such as summation, traces or diagonals defined with respect to only a single Hilbertspace can, as discussed in Section 2.3.6, be extended to partial operators acting on thewhole of Nα. For function spaces, these particular maps can be formulated explicitly. Thesummation over an index is defined as

sumαµ : Rnµ → R, sumαµ(x) = x1 + . . .+ xnµ .

The construction of a diagonal matrix from a vector is defined as

diagαµ : Rnµ → Rnµ×nµ , diagαµ(x) =

x1

. . .

xnµ

.

Conversely, taking the diagonal of a matrix is the above operation’s left inverse and denotedas

diagαµ : Rnµ×nµ → Rnµ , diagαµ(A) = (A11, . . . , Add)T .

The trace is defined as

traceαµ : Rnµ×nµ → R, traceαµ(A) = A11 + . . . Add,

which equals the composition traceαµ = sumαµ diagαµ . These operations can also beapplied to `2(R), yet the summation and trace do not necessarily need to converge. Extendedto tensor nodes spaces N(Hα), Hα ⊂ RΩα , these operations can also be written explicitly asin the following Remark 2.30, and have graphical interpretations as in Fig. 2.6. We thereforerecall that for a node N , the value m(N)(γi) is the multiplicity of γi in the label multisetf = m(N) (cf. comments below Notation 2.14).

Page 49: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 33

Remark 2.30 (Partial summation, diagonal and trace). Let Hαµ = Rnµ or Hαµ = `2(R),µ = 1, . . . , d, as well as N ∈ Nα and γ = γ1, . . . , γk ⊂ α.

• Provided m(N)(γi) ≥ 1, i = 1, . . . , k, we have that

sumγ(N) =∑

x∈Ωγ

N(γ = x)

and m(sumγ(N)) = m(N) \ γ.

• For m(N)(γi) ≥ 2, i = 1, . . . , k, it is

diagγ(N)(γ = x) = N((γ,γ) = (x, x)), ∀x ∈ Ωγ

and m(diagγ(N)) = m(N) \ γ.

• Provided m(N)(γi) ≥ 1, i = 1, . . . , k, we have that

diagγ(N)((γ,γ) = (x, x)) = N(γ = x), ∀x ∈ Ωγ ,

diagγ(N)((γ,γ) = (x, y)) = 0, ∀x 6= y ∈ Ωγ ,

with m(diagγ(N)) := m(N) ] γ.

• For m(N)(γi) ≥ 2, i = 1, . . . , k, it is

traceγ(N) =∑

x∈Ωγ

N((γ,γ) = (x, x)),

and m(traceγ(N)) = m(N) \ γ,γ#.

These operations are interrelated to one another. For example, for N1 = N1(α,β) andN2 = N2(α,β), we have that

diagα(N1 \α N2) = N1 α N2 = diagα(N1)N2 = traceβ(N1 α∅ N2), (2.33)

sumα(N1 α N2) = traceα(N1 \α N2) = N1 N2. (2.34)

The four definitions further have consistent visualizations:

diagα :

α

diagα :α

sumα :

α

traceα :

α

Equations (2.33) and (2.34) are depicted in Fig. 2.6, in which the different identities aremerely subject to different partitions and orderings of connecting lines.

N1

N2

α

β N1 \α N2

N1

N2

α

β N1 α N2

α

Figure 2.6: Left: Visualization of diagα(N1\αN2) (cf. Eq. (2.33)), Right: Visualization of sumα(N1α

N2) (cf. Eq. (2.34)).

Page 50: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

34 2.5. Common Tensor Formats and Decompositions

2.5 Common Tensor Formats and Decompositions

Throughout this section, we make use of Notation 2.14 in order to assign mode labels. Thus,for d ∈ N, and Hilbert spaces Hαµ , µ = 1, . . . , d,

T = T (α1, . . . , αd) ∈ Hα1 ⊗ . . .⊗ Hαd

denotes a d-dimensional tensor (node) with mode labels m(T ) = α1, . . . , αd, while formallyT ∈ Nα = N(Hα1

, . . . ,Hαd). We further set Hβi = Rri , ri = d(βi) ∈ N, i ∈ N. We do usuallynot necessarily hold on to an ordering between α, β and other mode label sets as long ascontext allows, but within this section, the ordering βµ−1 < αµ < βµ < ζ, for all possible µ,is most natural. All following diagrams are for dimension d = 4.

2.5.1 Tensor Train / Matrix Product States

The tensor train (TT) decomposition [84] or matrix product states (MPS) format [100] canbe written as

T = G1 . . .Gd =

G1 G2 G3 G4

α1

α2

α3

α4

β1 β2 β3

where G1 = G1(α1, β1) ∈ Hα1⊗ Rr1 , Gd = Gd(βd−1, αd) ∈ Rrd−1 ⊗ Hαd and otherwise

Gµ = Gµ(βµ−1, αµ, βµ) ∈ Rrµ−1 ⊗ Hαµ ⊗ Rrµ , µ = 2, . . . , d− 1.

In TT and MPS literature, a common notation involving T and G is T (i1, . . . , id) = G1(i1) ·. . . ·Gd(id), where, translated from our setting, T (i1, . . . , id) := T (α1 = i1, . . . , αd = id) andGµ(iµ) := Gµ(αµ = iµ)(βµ−1),(βµ) ∈ Rrµ−1×rµ , assuming Hαµ ⊂ KΩα . In that sense, eachentry of T is a matrix product state, hence the name MPS [100]. The name tensor train [84]originates from the vivid interpretation of the edges in the above graph as coupling of trains.The single Gµ are often referred to as cores.

2.5.2 Tucker / Higher-Order Singular Value Decomposition

The Tucker [95] or Higher-Order Singular Value Decomposition (HOSVD) [21] can be writtenas

T = C U1 . . . Ud, =

C

U1

U2 U3

U4

α1

α2

α3

α4

β1

β2

β3

β3

(2.35)

where Uµ = Uµ(αµ, βµ) ∈ Hαµ ⊗ Rrµ , µ = 1, . . . , d, and C = C(β1, . . . , βd) ∈ Rr1×...×rd .

The HOSVD involves additional gauge conditions (cf. Section 3.4.2, [21]). Here, only thenode C is commonly referred to as core, while Uµ are, in [40], also called mode frames. Acommon notation specifically for the HOSVD is A = S ×1 U

(1) ×2 U(2) . . . ×d U (d) [21],

where here A := T (α1),...,(αd), S := C(β1),...,(βd) and U (µ) := U (βµ),(αµ), µ = 1, . . . , d, assum-ing Hαµ ⊂ KΩα . In that article, an unfolding with respect to µ ∈ 1, . . . , d is also denoted

as A(µ) := T (αµ).

Page 51: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 35

2.5.3 Hierarchical Tucker

The hierarchical Tucker (HT) decomposition [40,48] for a balanced, binary tree for a tensorof dimension 4 can be written as

T = B12 B34 U1 U2 U3 U4 (2.36)

B12 B34

U1

U2

U3

U4

α1

α2

α3

α4

β1

β2

β3

β4

β5

for Uµ = Uµ(αµ, βµ) ∈ Hαµ ⊗ Rrµ , µ = 1, . . . , 4, and B12 = B12(β1, β2, β5) ∈ Rr1×r2×r5as well as B34 = B34(β3, β4, β5) ∈ Rr3×r4×r5 . If Hαµ ⊂ RΩαµ , µ = 1, . . . , d, such as inSection 2.4.2, then each single entry of T is given by

T (α = x) = B12 B34 U1(α1 = x1) U2(α2 = x2) U3(α3 = x3) U4(α4 = x4),

for x ∈ Ωα. In larger HT-networks, the contraction of all inner nodes such as B12 and B34

will yield the core C of the above Tucker decomposition. Likewise, any contraction of asubset of inner nodes, as well as the separate contractions of the thereby formed, remain-ing branches, yields a Tucker decomposition on a different level within the thought hierarchy.

In literature, these inner nodes are usually called transfer tensors. Note that we removed the(in later context redundant) so-called root transfer tensor B1234 = B1234(β5, β5) and obtainan undirected tree graph without root16. The conventional definition instead is based ona rooted tree. On each of its hierarchical levels, it utilizes the Tucker decomposition andsimilar notations, where [40] uses µ instead of ×µ. Our notation for matricizations stemsfrom that article, albeit we reference with labels instead of numbers.

2.5.4 Canonical Polyadic Decomposition

The canonical polyadic (CP) decomposition [55] (which also has several other synonymousnames) can be written as

T = (Φ1, . . . ,Φd) =Φ1 Φ2 Φ3 Φ4

α1

α2

α3

α4

ζ

(2.37)

where Φµ = Φµ(αµ, ζ) ∈ Hαµ ⊗ Rm, µ = 1, . . . , d, for m ∈ N. The (singleton) mode label ζhere takes the special role of a summation,

T =

m∑

i=1

Φ1(ζ = i) . . . Φd(ζ = i),

of elementary (rank-one) terms. The number m is hence also called the rank of T . Althoughthe graph depicted above is not a tree (but a hypertree), this format is of high practicalrelevance, in particular as the decomposition is unique under mild conditions. We howeverfocus on tree tensor formats (cf. Chapter 3).

16This difference will be emphasized when we encounter tree networks and hierarchies

Page 52: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

36 2.6. Tensor Node Ranks and Decompositions

2.5.5 Cyclic TT / MPS Format

The cyclic tensor train format introduces an additional edge between G1 and G4 and aimsto symmetrize the decomposition. With G2, . . . , Gd−1 as in Section 2.5.1, the first and lastnode are changed to G1 = G1(βd, α1, β1) ∈ Rrd ⊗ Hα1

⊗ Rr1 and Gd = Gd(βd−1, αd, βd) ∈Rrd−1 ⊗ Hα1

⊗ Rrd for rd ∈ N,

T = G1 . . .Gd =

G1

G2

G3

G4

α1

α2

α3

α4

β1 β 2

β3β 4

Valuable properties of the tensor train are however lost here, in particular the interpretationof rµ, µ = 1, . . . , d, as unique ranks (cf. Section 2.6.1). Thereby, the set of tensors repre-sented through networks that contain cycles, as opposed to trees (cf. Chapter 3), are in factnot closed for fixed r (cf. Section 3.3.1) as shown by [71].

2.5.6 Projected Entangled Pair States

The projected entangled pair states (PEPS) is another generalization of the TT/MPS for-mat, and familiar rather in physics. Here, the nodes are arranged on a two-dimensional grid(as opposed to the one-dimensional grid for MPS). Since the format for d = 4 is equivalentto the cyclic MPS format, we here consider a d2-dimensional tensor T with mode labelsm(T ) = αi,jdi,j=1. We further introduce additional mode labels βi,jdi,j=1 and γi,jdi,j=1.Then with nodes Gi,j , i, j = 1, . . . , d, as in the diagram for d = 3, we have

T = G1,1 G1,2 . . .Gd,d =

G1,1

α1,1

G1,2

α1,2

G1,3

α1,3

G2,1

α2,1

G2,2

α2,2

G2,3

α2,3

G3,1

α3,1

G3,2

α3,2

G3,3

α3,3

β1,1

β1,2

β1,3

β2,1

β2,2

β2,3

γ1,1

γ1,2

γ2,1

γ2,2

γ3,1

γ3,2

.

PEPS serves as a good example of networks that are well known for their applications,but which are even harder to handle from a theoretical perspective than the simple cyclicnetwork as in Section 2.5.5 or the CP format, Section 2.5.4.

2.6 Tensor Node Ranks and Decompositions

Conventional concepts known from matrices, in particular decompositions, are easily trans-ferred to tensor nodes, but are labeled with single or sets of mode labels.

Page 53: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 37

2.6.1 Ranks of a Node

Definition 2.31 (Ranks of a tensor node). Let N ∈ Nα and γ ⊂ m(N). We definer = rankγ(N) ∈ N0 ∪ ∞ as the minimal required number of summands

Si = Xi ∅ Yi, m(Xi) = γ, m(Yi) = m(N) \ γ,i = 1, . . . , r, for which N =

∑ri=1 Si.

The rank of a node demonstrates the singular role that the Euclidean spaces Rn, n ∈ N,hold. For a finite17 rank r, we define X = X(γ, β) and Y = Y (m(N) \γ, β) for Hβ = Rr viaX(β = i) = Xi, Y (β = i) = Yi, i = 1, . . . , r. This yields the decomposition N = X β Y :

Nγ m(N) \ γ

= X Yβγ m(N) \ γ

= X1γ ∅ Y1

m(N) \ γ+ . . . + Xr

γ ∅ Yrm(N) \ γ

For Hαµ = Rnµ , µ = 1, . . . , d, each of these ranks is the ordinary matrix rank of anunfolding

rankγ(N) = rank(N (γ),(m(N)\γ)).

It further holds that

N (γ),(m(N)\γ) = X(γ),(β) · Y (β),(m(N)\γ),

X(γ),(β) ∈ Rd(γ)×r, Y (β),(m(N)\γ) ∈ Rr×d(m(N)\γ),

as it is well known as low-rank decomposition.

2.6.2 Orthogonality of Nodes

Orthogonality of nodes is an important concept which we encounter in particular for treetensor networks, which we introduce in Chapter 3. As graphical notation, we draw arrowsin our diagrams, and surround roots of these implicitly directed graphs with black circles,as it emphasizes our perspective.

Definition 2.32 (Orthogonality of nodes). Let N ∈ Nα and γ ⊂ α such that m(N)\γ ⊂ α.We say N is γ-(column)-orthogonal if ‖N M‖ = ‖M‖ for all M = M(γ).Likewise, N is γ-(row)-orthogonal if ‖M N‖ = ‖M‖ for all M = M(γ).

These conditions are equivalent to ‖N γ M‖ = ‖M‖ (and ‖M γ N‖ = ‖M‖, respec-tively), for all M ∈ Nα. Further, N is γ-column-orthogonal iff NT is γ-row-orthogonal,since N M = M NT for all M = M(γ). If the identity matrix Iγ = Iγ(γ,γ),IγγM = MγIγ = M for all M ∈ Nα, exists18, then γ-column-orthogonality is equivalentto NT m(N)\γ N = Iγ and γ-row-orthogonality is equivalent to N m(N)\γ NT = Iγ .

Figure 2.7: The arrow indicates that the node is orthogonal with respect to γ.

If m(N)(γ`) ≤ 1 for ` = 1, . . . , k, then there is no difference between column- and row-orthogonality, and N is simply called γ-orthogonal. Then further, N \γ N = Iγ .

For Hαµ = Rnµ , µ = 1, . . . , d, a node N is γ-column(row)-orthogonal iff N (γ),(m(N)\γ)

is column(row)-orthogonal.

17For r = ∞, it may not be possible to decompose N into smaller nodes, such that function spaces ascarried out in Section 2.8.1 would be required to do so.

18For example for a finite-dimensional Hilbert space Hγ .

Page 54: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

38 2.6. Tensor Node Ranks and Decompositions

Theorem 2.33 (Orthogonality in products). Let N1, N2 ∈ Nα with m(N1),m(N2) ⊂ α.Further, let ψ = m(N1) ∩m(N2) as well as γ ⊂ m(N2) \ψ and δ ⊂ m(N1) \ψ. Then eachtwo of the following properties imply the remaining one:

• N1 N2 is (δ,γ)-orthogonal

• N1 is (δ,ψ)-orthogonal

• N2 is γ-orthogonal

M

N2

N1γ

ψ

δ

Figure 2.8: The situation in Theorem 2.33

Proof. Let M = M(γ, δ). Then due to associativity rules, (N1N2)M = (N1ψN2)δ,γM) = N1 ψ,δ (N2 γ M). If all constraints hold true, then ‖M‖ = ‖(N1 N2) M‖ =‖N2 M‖ = ‖M‖ for all such nodes M . It is now not possible to violate just one of theequalities, which was to be shown.

The two situations in which either δ = ∅ or ψ = ∅ are the most relevant ones appearingin later sections.

Lemma 2.34 (Orthogonality in Hadamard products). Let N ∈ Nα, m(N) ⊂ α, be γ-orthogonal, γ ⊂ m(N), as well as δ ⊂ γ. Further, let Hδ = Rd(δ) be an Euclidean space.Then ‖N δγ\δ M‖ = ‖M‖ for all M = M(γ).

Proof. Through use of the partial diagonal operation, we obtain ‖N δγ\δ M‖ = ‖N γdiagδ(M)‖ = ‖ diagδ(M)‖ = ‖M‖.

Corollary 2.35 (Orthogonality in chained Hadamard products). Let γ ⊂ α be fixed andHγ = Rd(γ) be an Euclidean space. Further, let N1, . . . , Ns ∈ Nα, for γ ⊂ m(Ni) ⊂ α,be γ-orthogonal and m(Ni) ∩ m(Nj) = γ for i, j = 1, . . . , s. Then N1 γ . . . γ Ns isγ-orthogonal.

Proof. We have to show that ‖(N1 γ . . . γ Ns) γ M‖ = ‖M‖ for all M = M(γ). Byassociativity (see Lemma 2.10), we have that ‖(N1 γ . . .γ Ns)γ M‖ = ‖N1 γ (N2 γ. . .γNsγM)‖ = ‖N2γ . . .γNsγM‖, where the last step follows by the orthogonalityof N1. By Lemma 2.34 (for δ = γ), it remains to show that N2γ . . .γ Ns is γ-orthogonalas well. This in turn follows by inductive reasoning.

2.6.3 Node SVD and Node QR-Decomposition

Definition 2.36 (Node SVD). Let N ∈ Nα and γ ⊂ α such that m(N) \ γ ⊂ α as well asHβ = Rr, r = rankγ(N) ∈ N (where β is allowed to be contained in α).

Further, let U = U(γ, β) be β-column-orthogonal, let Vt = Vt(β,m(N) \ γ) be β-row-orthogonal and σ = σ(β) with σi ≥ σi+1 > 019 for i = 1, . . . , r − 1. Then

(U, σ, Vt), N = β(U, σ, Vt)

19Since σ only has one mode label, we can use the shorter σi = σ(β = i).

Page 55: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 39

is called a (compact) node SVD with respect to γ of N . We further set Σ = diagβ(σ). Theentries of σ are the positive singular values of N with respect to γ and we define

svγ(N) := σ.

Note that in Chapter 7 it is more convenient to treat σ as infinite sequence, and weinstead use σ+ for the positive part of the singular values.

Nγ m(N) \ γ

= U

σ

Vtγ m(N) \ γβ

= U Σ Vtγ m(N) \ γβ β

Figure 2.9: A node SVD of A into U, σ, Vt. Singular value nodes such as σ and Σ are usually in magenta.

Restricting r = d(β) to a value lower than rankγ(N) yields a truncated SVD. In the caseHαµ = Rnµ , µ = 1, . . . , d, the node SVD corresponds to the ordinary (compact) matrix SVDby means of

N (γ),(m(N)\γ) = U (γ),(β) · Σ(β),(β) · V (β),(m(N)\γ)t ,

where U (γ),(β) is a column-orthogonal and V(β),(m(N)\γ)t is a row-orthogonal matrix, and

thus always exists for any finite-dimensional Hilbert spaces. For infinite-dimensional ones,the existence of truncated SVDs is provided by Theorem 2.4.

If |m(N) \ γ| = 1 and β := δ := m(N) \ γ, then we obtain the decomposition U = U(γ, δ),σ = σ(δ) and Vt = Vt(δ, δ). If we now wish to subsequently truncate the SVD, then aslight resulting issue can be resolved by falling back on Notation 2.24. As in Eq. (2.32),Vt((δ, δ) ∈ 1, . . . , r × 1, . . . , r)(δ),(δ) ∈ Rr×r is defined as a nonsquare matrix, wherer < r is the mode size obtained through truncation. One proceeds similarly with U and σ.

Definition 2.37 (Node QR-decomposition). Let N ∈ Nα and γ ⊂ α such that m(N) \γ ⊂α. Then

N = QR, Q = Q(m(N) \ γ,γ), R = R(γ,γ),

where Q is γ-column-orthogonal, is called node QR-decomposition with respect to γ of N .When Hγ ⊂ RΩγ , then we additionally demand R((γ,γ) = (x, y)) = 0 for all x > y ∈ R|γ|(with respect to co-lexicographic ordering).

Nm(N) \ γ γ

= Q Rm(N) \ γ γγ

Figure 2.10: A node QR-decomposition of A into Q,R.

In the case Hαµ = Rnµ , µ = 1, . . . , d, the node QR again corresponds to the ordinarymatrix version by means of

N (m(N)\γ),(γ) = Q(m(N)\γ),(γ) ·R(γ),(γ)

where Q(m(N)\γ),(γ) is a column-orthogonal matrix and R(γ),(γ) is upper-triangular.A related issue as with truncated SVDs results when N (m(N)\γ),(γ) does not have full rank(for example if wider than tall). When |γ| = 1, γ = αµ, these cases again fall back on No-tation 2.24 by means of Runf = R((αµ, αµ) ∈ 1, . . . , nµ × 1, . . . , nµ)(αµ),(αµ) ∈ Rnµ×nµ .Hence the resulting matrix is only generalized upper triangular and

Q(αµ ∈ 1, . . . , nµ)(m(N)\αµ),(αµ) = N (m(N)\αµ),(αµ) R†unf

is column-orthogonal, but only as it is restricted to a sufficiently small mode size nµ.

Page 56: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

40 2.7. Nested Products

2.7 Nested Products

Let T = T (α) = G1 . . . Gd = (G1, . . . , Gd) be a tensor train representation as inSection 2.5.1. The norm of T can be expressed as the nested product

‖T‖2 =( (G1, . . . , Gd)

)( (G1, . . . , Gd)

). (2.38)

If we want to simplify this computation, we encounter the problem that components in thisrepresentation share identical labels β1, . . . , βd−1. In order to provide remedy, we definecopies of such with modified labels β′1, . . . , β

′d−1. We denote this renaming via

G′1 := G1(β1 7→ β′1), G′µ := Gµ(βµ−1 7→ β′µ−1, βµ 7→ β′µ), µ = 2, . . . , d− 1,

G′d := Gd(βd−1 7→ β′d−1),

such that m(G′1) = α1, β′1#, m(G′µ) = β′µ−1, αµ, β

′µ#, µ = 2, . . . , d − 1, and m(G′d) =

β′d−1, αd#. In short, we also write G′µ = Gµ(β 7→ β′), µ = 1, . . . , d. Such modified ver-sions are assumed to equal their counterparts at all times, only with modified mode labels.

Using G′, Eq. (2.38) can be reformulated20,

‖T‖2 =( (G′1, . . . , G

′d))( (G1, . . . , Gd)

)

= (G′1, . . . , G′d, G1, . . . , Gd) = (G′1 G1) . . . (G′d Gd),

as depicted below:

G1 G2 G3 G4

G1 G2 G3 G4

α1

α2

α3

α4

β1 β2 β3

β1 β2 β3

=

G1 G2 G3 G4

G′1 G′2 G′3 G′4

α1

α2

α3

α4

β1 β2 β3

β′1 β′2 β′3

Contracting in the order given in the right-hand picture for example avoids to computethe possibly large tensor T . In general, finding an optimal order can be computationallyexhaustive. In [91], several approaches to this problem are discussed.

The process of renaming modes demonstrated above can be done for any nested product inorder to transform it into a single product, in which each contraction has been assigned aunique mode label. In the following two sections, we formalize these techniques21.

2.7.1 Mode Label Renaming

For function spaces, there is a comparatively simple way to define renaming as in the fol-lowing Definition 2.38.

Definition 2.38 (Renaming in function spaces). Let Hαµ = RΩαµ , µ = 1, . . . , d, and

N ∈ Nα. Further, let g(1) = (g(1)1 , . . . , g

(1)k ), m(N) = g(1)

1 , . . . , g(1)k #, as well as g(2) =

(g(2)1 , . . . , g

(2)k ) ∈ αk be two lists of mode labels for which H

g(2)i

= Hg(1)i

, i = 1, . . . , k. We

define

N ′ = N(g(1)1 7→ g

(2)1 , . . . , g

(1)k 7→ g

(2)k )

20The Matlab implementation of the arithmetic does so automatically in order to effectively evaluatenested products, and recalls the determined order of contractions for subsequent, equivalent cases.

21Essentially, this is the same as renaming indices as for example in∑4s=1 as ·

∑4s=1 bs =

∑4i=1

∑4j=1 aibj .

Page 57: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 41

via m(N ′) := g(2)1 , . . . , g

(2)k # and

N ′(g(2)1 = x1, . . . , g

(2)k = xk) := N(g

(1)1 = x1, . . . , g

(1)k = xk),

for all x ∈ Ωm(N) = Ωm(N ′).

We also abbreviate N ′ = N(g(1) 7→ g(2)). For incomplete lists of mode labels g(u) =

(g(u)1 , . . . , g

(u)s ), s < k = |m(N)|, u = 1, 2, such that m(N) = g(1)

1 , . . . , g(1)k #, g

(1)i ≤ g

(1)i+1,

i = s+ 1, . . . , k − 1, we set g(2)i := g

(1)i , i = s+ 1, . . . , k. We thereby define

N(g(1) 7→ g(2)) := N(g(1) 7→ g(2)).

We ignore mode labels g(1)i which are not included in those of N , as well as their counterparts

g(2)i in such a renaming. The lists g(1) and g(2) are further allowed to be ordered (mode

label) sets. For Hαµ = Rnµ , µ = 1, . . . , d, we have

N(g(1) 7→ g(2))(g(2)1 ),...,(g(2)s ) = N (g

(1)1 ),...,(g(1)s ).

Mode labels g(2)i /∈ α are considered to implicitly declare new Hilbert spaces H

g(2)i

:= Hg(1)i

,

provided the ordering of mode labels remains clear. For example, for N = N(α1, α2) ∈ Nα,we interpret N(α2 7→ α′2) ∈ Nα∪α′ such that m(N(α2 7→ α′2)) = α1, α

′2#.

For later chapters, it is convenient to define the following short notation. If not other-wise mentioned, the operation (·)′ renames duplicate appearances of mode labels, from rightto left:

Notation 2.39 (Default renaming). Let N = N(γ,γ,β) be a tensor node. If not otherwisespecified, then

N ′ = N ′(γ′,γ,β) := N((γ,γ,β) 7→ (γ′,γ,β)),

where γ and β may be (ordered) sets of mode labels.

For example, for N = N(α1, α1, β1), we have N ′ = N ′(α′1, α1, β1). The general situationis rather technical, and we state it only for reasons of completeness:

Definition 2.40 (Renaming). Let Hαµ be arbitrary Hilbert spaces, µ = 1, . . . , d. Then for

g(1) and g(2) as in Definition 2.38, we define N ′ for elementary tensors N = (w1 ⊗ . . . ⊗wk, g(1)

1 , . . . , g(1)k #) ∈ Nα, wi ∈ H

g(1)i

, via

N ′ := (v1 ⊗ . . .⊗ vk, g(2)1 , . . . , g

(2)k #),

vπ(2)j

:= wπ(1)j, j = 1, . . . , k,

where π(u), u = 1, 2, are the two unique permutations for which

π(u)i < π

(u)j ⇔ (g

(u)i = g

(u)j ∧ i < j) ∨ g

(u)i < g

(u)j .

This induces the renaming operation on the entirety of Nα.

For example, for N = (w1 ⊗ w2 ⊗ w3 ⊗ w4, α1, α2, α3, α3#), we have N(α3 7→ α1) =(w3⊗w1⊗w2⊗w4, α1, α1, α2, α3#), as well as N((α3, α3) 7→ (α3, α1)) = (w4⊗w1⊗w2⊗w3, α1, α1, α2, α3#) and N((α1, α3) 7→ (α1, α1)) = (w1⊗w3⊗w2⊗w4, α1, α1, α2, α3#).

Page 58: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

42 2.8. Operator Nodes and Complex Hilbert Spaces

2.7.2 Flattening

Nested products such as Eq. (2.38) require sequences of evaluations that rely on earlier ones(if proceeded in the given order). The general situation involving nodes N1, . . . , Ns ∈ Nαcan be expressed as follows. For each ` = 1, . . . , p, let

⋃k`

i=1S

(`)i = 1, . . . , k`−1

be a partition, with k0 = s, k` ∈ N, ` = 1, . . . , p − 1, and kp = 1. For every i = 1, . . . , k`,

` = 1, . . . , p, let K(`)i ] C

(`)i ⊂ α and

N(`)i = K

(`)i

C(`)i

(N(`−1)j1

, . . . , N(`−1)ju

), j1, . . . , ju = S(`)i , j1 < . . . < ju,

for N(0)i = Ni, i = 1, . . . , s. Then M = N

(p)1 is the result of this nested product.

For example, in the earlier Eq. (2.38), we have S(1)1 = 1, 2, 3, S(1)

2 = 4, 5, 6 and

S(2)1 = 1, 2. As before, we can rename until each contraction is assigned a unique mode

label. For that matter, we use a set of derived mode labels

A := α′µ, α′′µ, . . . | µ = 1, . . . , d

with ordering γ′ < γ for all γ ∈ A ∪ α. Then there exist mode label sets λ(i) ⊂ A ∪ α,i = 1, . . . , s, as well as K ′ ⊂ A ∪α such that

N ′i := Ni(m(Ni) 7→ λ(i)) ∈ NA∪α

M ′ := K′(N ′1, . . . , N ′s)

fulfills

(1) : m(N ′i) ⊂ A ∪α for i = 1, . . . , s

(2) : M = M ′((. . . , α′′1 , α′1, α1) 7→ (. . . , α1, α1)) . . . ((. . . , α′′d , α

′d, αd) 7→ (. . . , αd, αd))

Due to property (1), we have M ′ = K′(N ′π(1), . . . , N′π(s)) for any permutation π. Further-

more, the applied renaming only depends on given mode labels m(Ni). In Chapter 3, wecover networks of tensor nodes and often assume that properties similar to (1) hold a-priori.

2.8 Operator Nodes and Complex Hilbert Spaces

As carried out in Section 2.2.2, Hilbert-Schmidt operators can be interpreted as elements ofa topological tensor product space. For other operators, an analogous label based arithmeticas for tensor nodes can be defined, similar to Section 2.3.6. Due to its seldom use in laterchapters, we only briefly consider it in Section 2.8.1 below. We have also so far limitedourselves to the field K = R, and discuss the field K = C shortly in Section 2.8.2.

2.8.1 Linear and Bilinear Functions as Nodes

As we only use the following arithmetic in the initial sections of Chapter 6, we carry outseveral examples instead of a further, strict formalization. Some of these are beyond ourrequirements for the subsequent sections and remain guiding side notes22.

22Which allows to turn the infinite SVD Eq. (2.4) into an object based decomposition Φ = Φ(α, δ) =v Σ w, for v = v(α, β) : Hβ = `(R) → Hα with v(ei) = vi and w = w(δ, β) : Hβ = `(R) → Hδ with

w(ei) = wi, i ∈ N, as well as the compact operator Σ = Σ(β, β) = diagβ(σ) : `(R)→ `(R). The nodes v andw are then β-orthogonal.

Page 59: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

2. Calculus of Tensor Nodes and Networks 43

As previously carried out for elements in Hilbert spaces, we here assign mode labels tocontinuous, linear operators that act on these Hilbert spaces, such as

` ∈ hom0(Hβ ,Hα), ` : Hβ → Hα.

We then write L = L(α, β) = (`, α, β#). Let

φ : Hα ⊗ Hβ∼=−→ HS(Hβ ,Hα)

be the isometry defined in Eq. (2.6). For all Hilbert-Schmidt operators `, the interaction via with regular nodes N ∈ Nα,β is equivalent to the action of its counterpart,

Lγ N := (φ−1(`), α, β#)γ N, γ ⊂ α, β#.This, in an informal way, induces the behavior of L in the general case, as briefly describedin the following.

Let N1 = (v1, β#) ∈ N(Hβ) and N2 = (v2, α#) ∈ N(Hα) be two tensor nodes. Wedefine

N1 β L = Lβ N1 := (`(v1), α#)

as well as

Lα N2 = N2 α L := (`∗(v2), β#),

where `∗ is the adjoint of `. For ` : Hβ → Hβ , we define

Lβ N1 := (`(v1), α#), N1 β L := (`∗(v1), β#), Lβ L := (` `, β, β#)

Also two different nodes given through linear forms `1, `2 : Hβ → Hα can be contracted,

L1 β L2 = (`1 `∗2, α, α), L1 α L2 = (`∗1 `2, β, β),which remains conform with Hilbert-Schmidt operators. The same principles are applied tocontinuous, bilinear forms b : Hα × Hβ → R, where we write B = B(α, β) = (b, α, β#).We define the continuous, linear functionals with codomain R

B β N1 := (b(·, v1), α#), N2 α B := (b(v2, ·), β#).

Since the dual is isometric to the Hilbert space itself, these functionals can in turn beinterpreted as ordinary nodes. Accordingly

N2 α B β N1 := b(v2, v1) ∈ R.The contraction between the bilinear form B and the linear operator L is given by

B β L := (b(·, `∗(·)), α, α#), Lα B := (b(`(·), ·), β, β#).

These operations are also conform with Lax-Milgram, that is if the bilinear form is turnedinto a linear form by means of b(v, w) = 〈`b(v), w〉. Multiplications with nodes with disjointmode labels are just tensor products, but we treat the results as elements in algebraic tensorproduct spaces. Hence, for example for ` : Hβ → Hα and N3 = (V3, δ#) ∈ N(Hδ), we have

LN3 = (`⊗ V3, α, β, δ#) ∈ (hom0(Hβ ,Hα)⊗a Hδ)× α, β, δ#assuming α < β < δ. All remaining cases are induced with regard to Lemmas 2.5 and 2.6.Accordingly, the multiplication with N1 yields

(LN3)N1 = (`(v1)⊗ V3, α, δ#) ∈ (Hα ⊗ Hδ)× α, δ#,and follows the same scheme as in Lemma 2.23, where we merely did not assign modelabels to the operators. As previously done for ordinary tensor nodes, we rid ourselves ofbookkeeping and later identify L as its value `, and B as b.

Page 60: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

44 2.8. Operator Nodes and Complex Hilbert Spaces

2.8.2 Complex Hilbert Spaces

We only make use of the complex Hilbert space Cnµ in order to generalize Theorem 3.16 aswe require it for Chapter 7 and in the initial sections of Chapter 8. The discussion below isslightly generalized.

Let Hαµ , µ = 1, . . . , d, be a complex Hilbert spaces with complex scalar products 〈·, ·〉C,αµ .We assume that there exist functions · : Hαµ → Hαµ as well as real scalar products 〈·, ·〉αµsuch that

〈v1, v2〉C,αµ = 〈v1, v2〉αµ ,

for all v1, v2 ∈ Hαµ , µ = 1, . . . , d, which induces analogous functions · : Hγ → Hγ for allγ ⊂ α. Previous definitions and assertions are applicable for such spaces as well with minormodifications as follows. The norm of a node N = (v,α) ∈ Hα is given by

‖N‖ = N α N = (v,α)α (v,α) = 〈v, v〉α = 〈v, v〉C,α.

We call complex valued nodes unitary if they are orthogonal in the sense of Definition 2.32.

For γ and a node N as in Definition 2.32, we have that N is γ-column-unitary, iff NH := NT

is γ-row-unitary. If the identity matrix Iγ = Iγ(γ,γ) exists, then γ-column-unitary isequivalent to NH m(N)\γ N = Iγ and γ-row-unitary is equivalent to N m(N)\γ NH = Iγ .

For Hαµ = Cnµ , µ = 1, . . . , d, a node N is further γ-column(row)-unitary iff N (γ),(m(N)\γ)

is column(row)-unitary.

Page 61: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 3

Tree Tensor Networks

Tree tensor networks constitute an important class of networks and accompany us through-out subsequent chapters. They are also summarized under the name hierarchical Tuckerdecomposition as in Section 2.5.3 (with minor modifications compared to [40]), includingthe well known tensor train or MPS format, Section 2.5.1 and the Tucker format or HOSVD,Section 2.5.2. The canonical polyadic decomposition, Section 2.5.4, is on the contrary not atree tensor network and shares only some properties with such. We first discuss propertiesof general tensor networks and their relation to graphs, and subsequently define tree tensornetworks in Section 3.1.2.

Notation 3.1 (Tensor node networks). Let V = w1, . . . , ws be an ordered set, wi < wi+1,i = 1, . . . , s− 1, and let ψ be an (ordered) mode label superset. We call a set

N = Nwisi=1 ⊂ Nψ

a tensor node network. Further, let K ⊂ ψ. We define

Kv∈VNv := K(Nw1 , . . . , Nws). (3.1)

If the product is commutative, then V need not necessarily be ordered (cf. Proposition 2.17).

In the subsequent section, we formalize the graphical notation which we have alreadyfrequently used to visualize products of tensor nodes.

3.1 Graphs and Networks

In this section, we are particularly interested in the structure behind products such as inEq. (3.1). We have discussed in Section 2.7.2 how any multiplication of tensor nodes maybe transformed into this scheme.

3.1.1 Corresponding Graph

We here and in the following often write m(N) ⊂ ψ in order to remark on that a node Ndoes not have duplicate mode names.

Definition 3.2 (Corresponding graph). Let N = Nisi=1 ⊂ Nsψ be a tensor node network

with

m(Ni) ⊂ ψ, i = 1, . . . , s.

45

Page 62: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

46 3.1. Graphs and Networks

Further, let K ⊂ ψ and M = K(N1, . . . , Ns). We define the graph corresponding to thepair (N,K) as the (hyper-,multi-)graph G = (V,E), given by

V := 1, . . . , s,E := Cγ | γ ∈ ψ \m(M)#, (3.2)

where Cγ := v ∈ V | γ ∈ m(Nv). Additionally, we define the legs L of G as

L := Cγ | γ ∈ m(M)#.We extend the map m to the set V ,

m(v) := m(Nv), v ∈ V.We further define the graph corresponding to the network N as the one corresponding to(N, ∅).

The mode labels m(M) are referred to as outer mode labels, whereas all others are calledinner mode labels, i.e. those which are contracted and disappear in the product.

Vice versa, given a graph G, we say it corresponds to the network if G is equivalent toG (including legs and the function m restricted to V ).

Since the structure of the multiplication only depends on the mode labels m(Ni), i =1, . . . , s, the corresponding graph solely depends on the function m and the set K. Beforewe carry out examples (cf. Eq. (3.3)), we discuss two further, important concepts.

Remark 3.3 (Representation / contraction map). For fixed labels m|V : V → ψ, thesituation in Definition 3.2 yields a map τK = τKm defined via1

τK : D → Hm(M), D :=×v∈V

Hm(v), τK(N) := v∈V Nv.

which is continuous and multilinear. Instead of τ∅, we simply write τ .

This map is intrinsic to each network, τ(N) := v∈VNv, and thus determines the ter-minology to contract a collection of nodes. The central Theorem 3.16 clarifies the range ofthis map (cf. Eq. (3.10)) for tree tensor networks. Further investigations into properties ofthis map with respect to alternating optimization are carried out in Chapter 5.

Definition 3.4 (Edge label map). Let G be a graph corresponding to a network as in Defi-nition 3.2. To any subset S ⊂ V , we assign the set

m(S) := γ ∈ ψ | S = Cγ.We call this injective function m edge label map.

For every S ⊂ V , S /∈ E ∪ L, we hence have m(S) = ∅. For better readability, we writem(v) := m(v) and m(v, w) := m(v, w), v, w ∈ V . In particular for v, w ∈ E, this willconstitute the most frequent use of the edge label map. G is a multigraph iff |m(S)| > 1 forat least one S ∈ E ∪ L. For each set S with |m(S)| = 1, we also treat m(S) as the singlevalue it incorporates.

The tree graph G corresponding to the HT-representation, Eq. (2.36), for Nv := Uv,v = 1, . . . , 4, and further N5 := B12 as well as N6 := B34, is given by

V = 1, . . . , 6, (3.3)

E = 1, 5, 2, 5, 5, 6, 6, 3, 6, 4,L = 1, 2, 3, 4,

1We here identify Hγ with Hγ × γ.

Page 63: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 47

with m(v) = αv, v = 1, . . . , 4, m(1, 5) = β1, m(2, 5) = β2, m(3, 6) = β3, m(4, 6) = β4 andm(5, 6) = β5. This graph becomes easier to comprehend if we for a moment identify the setV with the original spelling of the tensor nodes:

V = B12, B34, U1, . . . , U4,E = U1, B12, U2, B12, B12, B34, B34, U3, B34, U4,L = U1, U2, U3, U4.

The representation map τ for values r = (r1, . . . , r5) ∈ N5 as in Section 2.5.3 is given by

τ : D → Hα, D =4×

µ=1

(Hαµ ⊗ Rrµ)× Rr1×r2×r5 × Rr3×r4×r5 .

The graph corresponding to the CP-decomposition Eq. (2.37) on the other hand is a hyper-graph with only one edge Φ1, . . . ,Φ4.

Given a graph G = (V,E), legs L and the function m, we can reconstruct the mode la-bels within a network N = Nvv∈V via

m(Nv) = m(v) = m(e) | e ∈ E ∪ L, v ∈ e, v ∈ V,K = m(`) | ` ∈ L, |`| > 1,

such that the graph G corresponds to the pair (N,K) in the sense of Definition 3.2. In thegraphical depiction of nodes, we continue to denote mode labels m(S) above such legs andthe edges. Further, m(e) | e ∈ L ∪ E is a partition of the set

⋃si=1 m(Ni).

Remark 3.5 (Graphs and compatible labels). A graph G itself does not include mode labels.However, the functions m and m, as above, can both be used to construct the mode labelsm(Nv) of a network as well as the set K. This in turn defines a corresponding graph G. Ifa graph G together with an edge label map m (or m) is given, we hence assume that these

are compatible in the sense that G = G.

The sets m(v) and m(S) may however not necessarily need to be specified and can beviewed as underlying to a fixed graph G.

Definition 3.6 (Graph notation). We say a (hyper)graph is a (hyper)tree if for each twonodes, there exists a unique path between them.

For any graph G = (V,E) and node v ∈ V , we define the set of predecessors (parents)and descendants (children) relative to a root node c ∈ V as

node−c (v) := w ∈ V \ v | there exists a path p = (c, . . . , w, v),node+

c (v) := w ∈ V \ v | there exists a path p = (c, . . . , v, w).

Further, we let node+v (v) := neighbor(v) and node−v (v) := ∅. As the branch of a node relative

to a root node c ∈ V we define

branchc(v) := v ∪ w ∈ V | there exists a path p = (c, . . . , v, . . . , w).

We say v is a leaf (relative to the root c ∈ V ) if node+c (v) = ∅.

A graph is a (hyper)tree iff |node−c (v)| ≤ 1 for all c 6= v ∈ V . In that case, we also treatnode−c (v) as the single value it contains. In particular in a tree, we further have

node−v (w) = v, node+v (w) = neighbor(w) \ v,

branchv(w) = w ∪⋃

h∈node+v (w)branchw(h),

for each edge v, w ∈ E.

Page 64: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

48 3.2. Orthogonality in Networks

3.1.2 Definition of Tree Tensor Networks

In tree tensor networks, as defined below, we additionally prohibit duplicate mode names toappear in any one node. They hence in particular fulfill the conditions of Proposition 2.17,through which the multiplication of contained nodes is commutative and associative.

Definition 3.7 (Tree tensor networks). Let N = Nisi=1 ⊂ Nψ with

m(Ni) ⊂ ψ, i = 1, . . . , s,

and let G = (V,E) be the graph corresponding to N. We say N is a (hyper)tree tensornetwork if G is a (hyper)tree graph (and not a multigraph).

The Tucker, tensor train and hierarchical Tucker format, as introduced in Section 2.5, consti-tute tree tensor networks. The canonical polyadic (CP) decomposition however is a hypertreetensor network (by our definition). The cyclic tensor train format and PEPS Sections 2.5.5and 2.5.6 are neither tree nor hypertree tensor networks.

For tree tensor networks, the edge label function m is particularly simple, since for everyv, w ⊂ E, we have

m(v, w) = m(v) ∩m(w).

This is however not the case for hypertrees. Note that ψ in Definition 3.7 is meant to bean arbitrary (but still ordered) superset of mode labels. Later, we use two different modelabels, in particular α and β, for specific instances of tree tensor networks,

N = Nvv∈V ⊂ Nα∪β.

While the single Hilbert spaces Hαµ , µ = 1, . . . , d, remain arbitrary, we usually have Hβi =

Rd(βi) for a by context given d(βi) ∈ N, for indices i = 1, . . . , |E|.Definition 3.8 (Redundancy). We call a tree or tree tensor network redundant, if there isa node u ∈ V which is contained in at most two edges e ∈ E, but not in any leg ` ∈ L.

Such a tensor node Nu can be merged with any neighboring node Nw without reducingthe expressiveness of the network. That is, for fixed γ = m(Nu Nw) and m|V , we have

range(τm) = range(τm),

where m is a mode label function defined on the vertices V of a modified network,

V := V ∪ u,w \ u,w, m(h) :=

γ if h = u,w,m(h) otherwise.

In explicit, this means that

v∈VNv | Nv ∈ Nψ, m(Nv) = m(v)= (v∈V \u,wNv)Nu,w | Nv, Nu,w ∈ Nψ, m(Nv) = m(v), m(Nu,w) = γ.

The node Nu,w ∈ Nψ hence corresponds to the combined neighbors Nu Nv.

3.2 Orthogonality in Networks

Orthogonality (cf. Theorem 2.33 and Definition 3.9) is of particular importance for tree ten-sor networks, since they do not contain loops. Certain procedures depend on this property,and hence are not compatible with for example the CP format (Section 2.5.4).

Page 65: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 49

3.2.1 Transitivity Properties

Definition 3.9 (Orthogonalized tree). Let N = Nvv∈V be a (hyper)tree tensor networkcorresponding to the (hyper)tree G = (V,E). We say the network is orthogonal with respectto Nc, c ∈ V , or just Nc- or c-orthogonal, if for all v ∈ V and h ∈ node+

c (v), the node Nhis m(v) ∩m(h)-orthogonal.

The orthogonality of tree tensor networks is an meaningful concept which reappears invarious assertions and algorithms. The gauge conditions it introduces are a weaker versionof the ones established by the tree SVD in Theorem 3.16.

Lemma 3.10 (Orthogonality of branches). Let N = Nvv∈V be a tree tensor networkcorresponding to the (hyper)tree G = (V,E) that is c-orthogonal for a node c ∈ V . Further,let Hγ = Rrγ , rγ = d(γ) ∈ N, for all inner mode labels γ ∈ ⋃e∈E m(e). Then for H ⊂ V ,

δ :=⋃

h∈Hm(h) ∩m(node−c (h)), W := branchc(H) :=

⋃h∈H

branchc(h) ⊂ V,

and every δ ⊂ δ, the tensor node T := δw∈WNw is δ-orthogonal.

Proof. Since by assumption N is c-orthogonal, every Nv, v ∈ W , is m(v) ∩ m(node−c (v))-orthogonal. We prove the to be shown assertion by induction over smaller branches containedin W . Let h ∈ H. Then by induction hypothesis, for every v ∈ node+

c (h),

Bc,v := w∈branchc(v)Nw,

is γ := m(v) ∩m(h)-orthogonal. By Theorem 2.33 and Corollary 2.35, the branch product

Bc,h := w∈branchc(h)Nw = Nh γ (γv∈node+c (h)

Bc,v)

is γ-orthogonal as well. Let now δ =: γ1, . . . , γk and δ \ δ =: γ1, . . . , γk. For each

i = 1, . . . , k, the product

Yi := γih∈H : γi∈m(h)Bc,h

is γi-orthogonal due to Corollary 2.35. Likewise, for each i = 1, . . . , k, the product

Yi := h∈H : γi∈m(h)Bc,h

is γi-orthogonal by Theorem 2.33. The tensor product of these factors,

T = Y1 ∅ . . .∅ Yk ∅ Y1 ∅ . . .∅ Yk,

is hence δ-orthogonal.

By Lemma 3.10, each product of disjoint branches is orthogonal, relative towards theroot node c to which the network is orthogonalized (cf. Fig. 3.1).

Nc Nv2

Nv1

w∈branchc(v1)Nw w∈branchc(v2)Nw

γ1

γ2

Figure 3.1: A c-orthogonal tree tensor network. The node product T (Lemma 3.10 for H = v1, v2) isγ-orthogonal, γ = γ1, γ2. The current root node is surrounded with a black circle.

Page 66: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

50 3.2. Orthogonality in Networks

Theorem 3.11 (Extension of SVDs). Let N = Nvv∈V be a c-orthogonal tree tensornetwork corresponding to the tree G = (V,E) for some root node c ∈ V and T = v∈VNv.Further, let γ ⊂ m(c) and Nc = (U, σ, Vt), σ = σ(β), be a node SVD of Nc with respect toγ. Then for H := v ∈ neighbor(c) | m(v, c) ⊂ γ, the nodes

U = U v∈branchc(H) Nv, Vt = Vt v∈branchc(neighbor(c)\H) Nv

form an SVD of T = (U , σ, Vt) with respect to m(U) \ β.

U

σ

Vt

Nv′1

Nv′2

γ1

γ2

γ3

βU Vt

Nc

Figure 3.2: Extension of the SVD of Nc to an SVD of T in a c-orthogonal network (Theorem 3.11). Theresulting network is orthogonal with respect to the singular vector node σ.

Proof. Let Bc,H := v∈branchc(H)Nv and Bc,Hc := v∈branchc(neighbor(c))\HNv. First, sinceN is a tree tensor network, it then holds that

(U , σ, Vt) = (U Bc,H , σ, Vt Bc,Hc) = (Bc,H , (U, σ, Vt), Bc,Hc)

= (Bc,H , Nc, Bc,Hc) = Nc v∈branchc(neighbor(c)) Nv = T.

Further, Bc,H is γ-orthogonal by Lemma 3.10, γ = m(Bc,H)∩m(U). Since U is β-orthogonal,the product U Bc,H is β-orthogonal as well by Theorem 2.33. Analogously, it follows that

Vt Bc,Hc is β-orthogonal. Hence, (U , σ, Vt) is an SVD of T .

The assumption in Theorem 3.11 is stricter than necessary. The network need not bec-orthogonal, but it suffices that the single branch products hold the relevant orthogonalityconditions. We emphasize an important consequence:

Corollary 3.12. In the situation of Theorem 3.11, both rank and singular values of T withrespect to m(U) \ β (must) equal those of Nc with respect to γ.

Proof. Follows directly by the uniqueness of singular values.

3.2.2 Orthogonalization of a Tree Tensor Network

In any tree tensor network, gauge conditions allow to be adapted in order to obtain c-orthogonality, for any one c ∈ V at a time (cf. Fig. 3.3). Seldomly, this is possible in otherkind of networks, not even hypertrees.

Proposition 3.13 (Orthogonalization). Let N = Nvv∈V be a tree tensor network cor-responding to the graph G = (V,E) and c ∈ V a root node. Further, let Hγ = Rrγ ,rγ = d(γ) ∈ N, for all inner mode labels γ ∈ ⋃e∈E m(e). Then there exists a network

N = Nvv∈V with m(Nv) = m(v), v ∈ V , (but possibly smaller mode sizes rγ < rγ) suchthat

v∈VNv = v∈V Nv,

where the network N is c-orthogonal. Furthermore, for any v ∈ V , it holds

Nv = w∈node+c (v)Rw Nv R†v

Page 67: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 51

for tensor nodes

Rv = Rv(γ, γ), γ = m(v,node−c (v)), v ∈ V \ c

and pseudoinverses (R†v)(γ),(γ) := (R

(γ),(γ)v )†.

Proof. Follows using successive QR-decompositions along the paths towards c as in Algo-rithm 1. In that case, Rv stems from the QR-decomposition (with respect to γ)

w∈node+c (v)Rw Nv = QRv

as illustrated in Fig. 3.3.

Nv1NcNv2

Nv3

Nv4

Rv2R†v2

Rv3

R†v3

Rv4

R†v4

β1β1β1β2β2β2

β3

β3

β3

β4

β4

β4

Figure 3.3: Orthogonalization through successive QR-decompositions. The network N is indicated throughthe highlighted areas. The order of QR-decompositions is irrelevant as long as they are proceeded leaf toroot.

The recursion in Algorithm 1 is an archetype which reappears in several forthcoming algo-rithms and is a means of running through all nodes in order of root-to-leaves or vice versa,leaves-to-root, following the notation scheme in Fig. 3.4.

If a given network is already c1-orthogonal, then an according c2-orthogonal network can becalculated by just performing subsequent QR-decompositions along the path (c1, . . . , c2) asin Algorithm 2, instead of the more costly Algorithm 1. In both cases, one may replace theQR-decompositions with corresponding SVDs.

cpbh

node+c (b)branchc(b) node−c (b)

Figure 3.4: Hierarchy of nodes as in root-to-leaves and leaves-to-root algorithms. Since the letters r and nare already taken, we have to content ourselves with slightly more creative memorization: c is the centerof the tree, p is the parent, a letter with a line pointing down, b as in branch is a letter with a line pointingup and so is h while being closest looking to n as in neighbor.

Page 68: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

52 3.2. Orthogonality in Networks

Algorithm 1 Recursive orthogonalization of a network

Input: a tree tensor network N = Nvv∈V corresponding to a tree graph G = (V,E) and a rootnode c ∈ V

Output: equivalent network N which is c-orthogonal

1: procedure ortho(N, c)2: declare N nested variable regarding orthorec3: orthorec(c, ∅)4: return N . N is c-orthogonal5: end procedure

6: function orthorec(b, P ) . |P | ≤ 17: for h ∈ neighbor(b), h /∈ P do . may be performed in parallel8: orthorec(h, b) . node+

c (b) = neighbor(b) \ P9: end for

10: if ∅ 6= P =: p then11: γ ← m(p, b) . m(p, b) = m(p) ∩m(b)12: do a QR-dec. of Nb with respect to γ: . cf. Definition 2.37

Nb = QR, R = R(γ, γ)

13: Nb ← Q . Q is γ-orthogonal14: Np ← R Np . mode size d(γ) might be reduced now15: end if16: end function

Algorithm 2 Successve QR-decomposition from c1 to c2

Input: a c1-orthogonal tree tensor network N = Nvv∈V corresponding to a tree graph G = (V,E)and another node c2 ∈ V

Output: equivalent network N which is c2-orthogonal

1: procedure pathqr(N, c1, c2)2: let p = (c1, . . . , c2) be the path within G from c1 to c23: for i = 1, . . . , length(p)− 1 do4: γ ← m(pi, pi+1) . m(pi, pi+1) = m(pi) ∩m(pi+1)5: do a QR-dec. of Npi with respect to γ: . cf. Definition 2.37

Npi = QR, R = R(γ, γ)

6: Npi ← Q . Q is γ-orthogonal7: Npi+1 ← R Npi+1 . possibly shrinks mode size d(γ)8: end for9: return N

10: end procedure

Page 69: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 53

3.3 Tree Tensor Network Decompositions

In this section, we discuss different procedures to decompose (or truncate) a d-dimensional2

tensor

T = T (α) ∈ Hα = Hα1⊗ . . .⊗ Hαd , α = α1, . . . , αd,

into tree tensor networks as in Definition 3.7. Vice versa, this defines the approximationproperties (or expressiveness) of a specific type of network (cf. Eq. (3.10)).

For the following, we define D := 1, . . . , d as well as the set of singleton subsets P1(S) :=s ⊂ S | s ∈ S for sets S.

3.3.1 One-to-One Correspondance of Edges and Subsets of Legs inTrees

A tree decomposition is, as the name suggests, a decomposition of T into a tree tensornetwork N, such as

T = v∈VNv, Nv ∈ Nα∪β, v ∈ V,

for a corresponding tree graph G = (V,E) together with legs L. It serves as a generalizationof the low-rank decomposition of a matrix. Here, and as usual in the following, α is usedfor the outer mode labels associated to legs L, whereas β is used for the inner mode labelsassigned to the edges E,

m(S) ∈ P1(α), ∀S ∈ L, m(S) ∈ P1(β), ∀S ∈ E. (3.4)

The structure of the tree tensor network may either be given directly through (the modelabels of) Nv, v ∈ V , or through a suitable tree G. For v, w ∈ V , we define

bv(w) := m(h) | h ∈ branchv(w).

Each set bv(w) is the collection of mode labels assigned to legs L on one side of the edgev, w ∈ E. Here, it hence holds

bv(w) = m(v∈branchv(w)Nv) ∩α.

For the binary HT format as in Eq. (3.3), we for example have b5(6) = α3, α4 = m(B34 U3 U4) ∩ α and b4(5) = α1, α2 = m(B12 U1 U2) (recall that we assigned numbersto each node in this format). For networks with outer mode labels α1, . . . , αd

3, we furtherdefine the so-called corresponding family

K := J ⊂ 1, . . . , d− 1 | ∃ v, w ∈ V : αJ = bv(w), αJ := αjj∈J . (3.5)

There is hence a one-to-one correspondence of edges e ∈ E to subsets J ∈ K. We willtherefore, if context is clear, reference such corresponding pairs as

e = eJ , J = Je.

Sometimes, as for example in Chapter 5, we are not necessarily interested in the subsetJe ⊂ 1, . . . , d− 1, but possibly in its complement. For v, w ∈ V , the more specific Jv,w isdefined via

αJv,w = bv(w).

2This dimension d is also often referred to as the order of the tensor.3In other cases, K may as well be defined as set of subsets in m(T ).

Page 70: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

54 3.3. Tree Tensor Network Decompositions

This correspondence also extends to ranks and singular values of tensors. Some objects mayhence be referred to in two different ways, using the upper index ·(J) for sets J ∈ K or thelower index ·e for edges e ∈ E. Since for e = v, w ∈ E and βie = m(v, w), ie ∈ 1, . . . , |E|,we have

T =(v∈branchw(v) Nv

)βie

(v∈branchv(w) Nv

),

it holds that

rankbv(w)(T ) ≤ d(βie).

If the above inequality does not yet hold as equality, then the network N can be truncatedwithout changing T , for example by the orthogonalization procedure in Proposition 3.13.

Definition 3.14 (Hierarchical family). A hierarchical family K is a set of subsets K ⊂ J |J ⊂ D for which the hierarchy condition

J ⊂ S ∨ S ⊂ J ∨ J ∩ S = ∅ (3.6)

is fulfilled, as well as J 6= D \ S, for all J, S ∈ K.

These families are closely related to the so-called mode cluster trees as described in [40].The family K as defined by Eq. (3.5) is such a hierarchical family. Vice versa, a hierarchicalfamily uniquely (up to equivalence) defines a tree G, network N (with outer mode labels α)and map m such that

J,D \ J | J ∈ K = S,D \ S | ∃ v, w ∈ E : αS = bv(w). (3.7)

We discuss this relation further in Section 3.5.1. For now, we assume that G and m arealready given.

Of particular interest are tensors that exhibit low-rank properties with respect to the differ-ent mode subsets αJ , J ∈ K:

Definition 3.15 (Low-rank tensor spaces). For a (not necessarily hierarchical) family K ⊂J | J ⊂ 1, . . . , d− 1 and values k = k(J)J∈K ∈ NK, we define

T≤k,K(Hα) :=⋂

J∈KT ∈ Hα | rankαJ (T ) ≤ k(J).

We further define

Tk,K(Hα) :=⋂

J∈KT ∈ Hα | rankαJ (T ) = k(J).

If K or Hα are clear by context, we may skip them.

As intersection of a finite number of closed sets, Tk,K(Hα) is closed as well. As discussedabove, if K is hierarchical (implying a or given through a graph G), we may as well referencethe values k by edges, k(J) = keJ . Further, for such families, the set Tk,K(Hα), Hα =Rn1×...×nd , is a manifold as shown in [98].

3.3.2 Tree Decomposition and Tree SVD

In the following theorem, part (1) is the tree tensor network analogous to a low-rank ma-trix decompositions, while (2) is the analogous to a matrix singular value decomposition,described in Algorithm 3 and shown in Fig. 3.5. A detailed discussion and comparison toliterature can be found below the theorem and further below its proof.

Page 71: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 55

Theorem 3.16 (Tree decomposition and tree SVD). Let T = T (α) ∈ Hα be a d-dimensionaltensor. Further, let G = (V,E) be a (non redundant) tree with d singleton legs L ⊂ v |v ∈ V and an injective edge label function m as in Eq. (3.4), for distinct mode labelsα = α1, . . . , αd, β = β1, . . . , β|E|.

Assume that for each edge e = v, w ∈ E, the rank (cf. Sections 2.6 and 3.3.1)

re = re(T ) := rankαJe (T ), (3.8)

is finite. Further, let ie ∈ 1, . . . , |E| such that βie = m(e) and Hβie := Rre , as well as

σe = σe(βie) = svαJe (T ), e ∈ E,

be the singular values of T with respect to αJe and Σe := diagβie (σe).

Then the tensor T has a tree decomposition as well as a tree SVD:

(1) Tree Decomposition: There exists a tree tensor network N = Nvv∈V ⊂ Nα∪βcorresponding to G such that

T = v∈VNv. (3.9)

(2) : Tree SVD: There exists an essentially unique network (cf. Fig. 3.5),

Nσ := N σν ν∈V = Nvv∈V ∪ σee∈E ,

such that Nvv∈V corresponds to G and Nσ corresponds to the hypergraph G = (V, E),

V := V ∪ E,E := e ∪ e | e ∈ E

with legs L = L and the following properties:

i) It holdsT = ν∈V N σ

ν .

ii) For each edge e = ω1, ω2 ∈ E,

Nω2h∈(neighbor(ω2)∩V )\ω1 Σω2,h

is orthogonal with respect to βie .

Let Nσ

be another such network. Essentially unique here means that for each e ∈ E, thereexists a βie-(column and row)-orthogonal node We = We(βie , βie), which commutes with Σe,such that for all ω ∈ V it holds

Nω = Nω v∈neighbor(ω)∩V Wv,ω.

Furthermore, i) ∧ ii) is equivalent to the following statement:

iii) For each hyperedge ω1, ω2, e ∈ E in G, e ∈ E, the triplet

U = ν∈branchω1 (ω2) N σν , σe, Vt = ν∈branchω2 (ω1) N σ

ν

is an SVD of T with respect to m(U) \ βie = αJω1,ω2= bω1

(ω2).

Page 72: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

56 3.3. Tree Tensor Network Decompositions

Note that the hypertree G has quasi the same structure as G, but with an additionalnode σe added to each edge e between two neighboring cores. In terms of the representationmap as in Remark 3.3, the theorem in turn implies that

range(τm) = T≤k,K(Hα) (3.10)

where τm is the contraction map defined through the labels in N.

N3

N1

N2

N4 N5

σ1,3

σ2,3

σ3,4 σ4,5

N3 Σ3,1 Σ3,2

N4 Σ4,5

Figure 3.5: Tree SVD Nσ of a tensor as specified by the graph G = (V,E), V =1, . . . , 5, E = 1, 3, 2, 3, 3, 4, 4, 5 and legs L = 1, 2, 4, 5. The graph cor-responding to Nσ is given by G = (V, E), V = 1, . . . , 5, 1, 3, 2, 3, 3, 4, 4, 5, E =1, 3, 1, 3, 2, 3, 2, 3, 3, 4, 3, 4, 4, 5, 4, 5 and legs L = L. Each node combined with all but oneneighbor Σe is orthogonal with respect to the remaining mode label m(e). The framed node σ3,4 togetherwith the products surrounded by dotted lines form an SVD of T with respect to (m(N1)∪m(N2))∩α = b4(3).The hypertree network, given by combining the two shaded areas into two single nodes, is orthogonal withrespect to σ3,4.

The essential uniqueness is a generalization of the matrix case. Both UΣV T and UΣV T

are truncated SVDs of a matrix A if and only if there exists an orthogonal matrix W thatcommutes with Σ and for which U = UW and V = VW . For any subset of pairwise distinctnonzero singular values, the corresponding submatrix of W needs to be diagonal with entriesin −1, 1.

In matrix product state literature, the tree SVD for the tensor train format (cf. Section 3.4.1)has been known as canonical form [100], but also appears as standard representation andnormal form. Given the Tucker format, part ii) incorporates the all-orthogonality conditionof the core tensor (cf. Eq. (3.20) and Section 3.4.2, [21]). As we have discussed in Sec-tion 2.5.3, we omit the root node in hierarchical Tucker formats, but this role can be takenby any single one of the nodes σe, e ∈ V. The conditions referred to as orthogonality ofnested frames (cf. Eq. (3.21), [40]) are then equivalent to the e-orthogonality of the networkNvv∈V ∪ σe (cf. property iii)). In Section 3.3.3, we briefly remark on the relation toso-called minimal subspaces.

The tensor T does not necessarily need to have mode labels α, but may be any ob-ject that can be interpreted as such. For example, we may be given T = T (δ,λ) ∈Rm1×...×md×n1×...×nd , but still apply the tree decomposition interpreting αµ = δµ, λµ,i.e. Hαµ = Hδµ ⊗ Hλµ .

Proof.• uniqueness:Assume that properties iii) holds. The nodes σe, e ∈ E are, as singular values of T , unique.Due to the essential uniqueness of ordinary singular vectors, for each s 6= v ∈ V and

Us,v := µ∈branchs(v)⊂V N σµ (3.11)

(analogously so for Nσ) we have

Us,v = Us,v We, e = v,node−s (v), (3.12)

Page 73: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 57

where We = We(βie , βie) is an βie -(column and row)-orthogonal matrix that commutes withΣe, that is

We Σe = Σe We. (3.13)

For w ∈ V , it follows

v∈node+s (ω)∩V (Us,v Σv,ω) Nω Wnode−s (ω),ωEq. (3.11)

= Us,ω Wnode−s (ω),ωEq. (3.12)

= Us,ω

Eq. (3.11)= v∈node+s (ω)∩V (Us,v Σv,ω)Nω

Eq. (3.12)= v∈node+s (ω)∩V (Us,v Wv,ω Σv,ω)Nω

Eq. (3.13)= v∈node+s (ω)∩V (Us,v Σv,ω)v∈node+s (ω)∩V Wv,ω Nω

Since the mode sizes d(βie), βie = m(e), e ∈ E, are given by the corresponding ranksre = rankβie (T ) of T , i.e. the number of non zero singular values, the mapping

H → v∈node+s (ω)∩V (Us,v Σv,ω)H, H = H(m(Nω))

is injective. Hence,

v∈node+s (ω)∩V Wv,ω Nω = Nω Wnode−s (ω),ω

⇔ Nω = Nω v∈neighbor(ω)∩V Wv,ω,

proving the essential uniqueness of N.

• existence of N:Let c be the root node required by Algorithm 3. Furthermore, let V ′ ⊂ V be the set of nodeindices already assigned through line 16 and TV

′be the variable T at that state. Due to

construction, it holds

T ∅ = TV′ v∈V ′ Nv = v∈VNv

where the network

NV ′ := TV ′ ∪ Nvv∈V ′ (3.14)

is, by construction, orthogonal with respect to the node TV′. Further, in line 17 in which

σe is assigned, e = b, p, p = node−c (b), it holds V ′′ := branchc(b) ∩ V ⊂ V ′. Due toorthogonality constraints, we have that

rankm(b)\m(e)(TV ′\b) = rankαJp,b (T ∅) =: re

is finite. Hence, the SVD in line 14 is finite and

(v∈V ′′Nv, σe, v∈V ′\V ′′Nv (Σ−1e TV

′)), (3.15)

is an SVD of T ∅ (cf. Theorem 3.11). We can therefore indeed assign Hβie = Rre .

• property i):Let c ∈ V be the root node required by Algorithm 3. By construction,

ν∈V N σν = v∈V (Nv ω∈node+c (v)∩V Σω,v)

line 22= v∈V ((Nv ω∈node+c (v)∩V Σ−1

ω,v)ω∈node+c (v)∩V Σω,v) (3.16)

= v∈V Nv 3.9= T

Page 74: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

58 3.3. Tree Tensor Network Decompositions

• equivalence of i) ∧ ii) and iii):“⇒ “: Given i) and ii), property iii) follows under use of the transitivity of orthogonality(cf. Theorem 2.33).

“⇐ “: Assume iii) holds. We can use Theorem 2.33 as well. Let V = (neighbor(ω2) ∩ V ) \ω1. Since the second left hand factor is m(Nω2) \m(ω1, ω2)-orthogonal and the righthandside in

(Nω2h∈V Σω2,h) (h∈V Uω2,h) = Uω1,ω2

is m(ω1, ω2)-orthogonal, so is the first left hand factor, which was to show.

• properties iii) and ii):Let ω1, ω2, e ∈ E , e ∈ E. We have already independently proven i), so only the or-thogonality of the respective U and Vt remains to be shown. Similarly to Eq. (3.16), thetriplet

U = ν∈branchω1(ω2) N σ

ν , σe, Vt = ν∈branchω2(ω1) N σ

ν ,

is exactly the one in Eq. (3.15) (possibly after switching ω1 and ω2). The orthogonalityconditions hence follow by construction. Property ii) then follows due to equivalence.

The orthogonality constraints towards N necessarily remain invariant within the set ofvalid, essentially unique representations. We can also verify this directly, considering thatthe orthogonal We, and Σe commute, whereby

(Nω2h∈neighbor(ω2)∩V Wω2,h)h∈node+ω1

(ω2)∩V Σω2,h

= (Nω2 h∈node+ω1(ω2)∩V Σω2,h)h∈neighbor(ω2)∩V Wω2,h

such that Nω2h∈node+ω1

(ω2)∩V Σω2,h is m(ω1, ω2)-orthogonal as well.

To construct the HT-representation as in Eq. (2.36), for example, the sequence of SVDsgiven below may be performed. The mode labels of the nodes are indicated below themand the mode label(s) over each arrow indicates an SVD with respect to these. Note thatAlgorithm 3 may be run in parallel. In other words, all SVDs can be performed in arbitraryorder, as long as they are proceeded in order leaves-to-root:

T(α)

α1−→ U1(α1,β1)

, σ1,5(β1)

, V1(β1,α2,...,α4)

(3.17)

Σ1,5 V1

(β1,α2,...,α4)

α2−→ U2(α2,β2)

, σ2,5(β2)

, V12(β1,β2,α3,α4)

(3.18)

Σ2,5 V12

(β1,β2,α3,α4)

α3−→ U3(α3,β3)

, σ3,6(β3)

, V123(β1,...,β3,α4)

(3.19)

Σ3,6 V123

(β1,...,β3,α4)

α4−→ U4(α4,β4)

, σ4,6(β4)

, C(β1,...,β4)

(3.20)

Σ4,6 C(β1,...,β4)

β1,β2−→ B12(β1,β2,β5)

, σ5,6(β5)

, B34(β5,β3,β4)

(3.21)

The steps Eq. (3.17) to Eq. (3.20) and C := Σ4,6 C give a Tucker decomposition(C,U1, . . . , U4) as in Eq. (2.35) for d = 4, where C is all-orthogonal. Additionally, C :=(Σ−1

1,5, . . . ,Σ−14,6, C) gives the Tucker tree SVD with graph

V = C, U1, . . . , U4, σ1,5, . . . , σ4,6,E = C, U1, σ1,5, . . . , C, U4, σ4,6

Page 75: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 59

Algorithm 3 Recursive decomposition of a tensor (leaves-to-root)

Input: a tensor node T = T (α) and a graph G (as in Theorem 3.16) as well as a root node c ∈ VOutput: tree tensor network N such that T = v∈VNv and its tree SVD Nσ

1: procedure ltrdec(T,G, c)2: let N := Nvv∈V , N := Nvv∈V , σ = σee∈E3: declare N, N, σ and T nested variables regarding decomprec4: decomprec(c, ∅)5: return N, Nσ = N ∪ σ . N is c-orthogonal6: end procedure

7: function decomprec(b, P ) . |P | ≤ 18: for h ∈ neighbor(b), h /∈ P do . may be performed in parallel9: decomprec(h, b) . node+

c (b) = neighbor(b) \ P10: end for

11: if ∅ 6= P =: p then12: γ ← m(p, b) . m(p, b) = m(p) ∩m(b)13: δ ← m(b) \ γ . δ = m(Nb) \ γ14: do an SVD of T with respect to δ: . cf. Definition 2.36

T = (U, s, Vt), s = s(γ)

15: T ← U T . U T = diagγ(s) Vt16: N b, Nb ← U . U is γ-orthogonal17: σp,b ← s18: else19: N b, Nb ← T20: end if

21: for h ∈ neighbor(b), h /∈ P do22: N b ← N b Σ−1

h,b23: end for24: end function

Likewise4, with B34 := Σ5,6 B34, the tree tensor network (B12, B34, U1, . . . , U4) is anHT-decomposition as in Eq. (2.36). With

B12 := (Σ−11,5,Σ

−22,5, B12), B34 := (Σ−1

3,6,Σ−14,6,Σ

−15,6, B34),

the graph, for which we again replace the numbers in V with the spelling of nodes,

V = B12,B34, U1, . . . , U4, σ1,5, . . . , σ4,6, σ5,6E = B12, U1, σ1,5, B12, U2, σ2,5, B34, U3, σ3,6, B34, U4, σ4,6,

B12,B34, σ5,6corresponds to the tree SVD.

Remark 3.17 (Interrelation between ranks and mode sizes). The different ranks re and therelated singular values σe, e ∈ E, are not independent of each other. For example, for anyv ∈ V , w ∈ neighbor(v), it holds

rv,w ≤ s ·∏

h∈neighbor(v)\wrv,h. (3.22)

4In the conventional HT-decomposition, this step would be skipped. Instead, B1234 := Σ5,6((β5, β5) 7→(β5, β6)) serves as root transfer tensor with neighboring nodes B12 and the modified B34(β5 7→ β6)

Page 76: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

60 3.4. Tree SVDs for TT and Tucker

with s = d(m(v)) if v is contained in a leg, or s = 1, otherwise.

This phenomenon, in particular the interrelation between different tuples of the singularvalues, is the topic of Part III.

3.3.3 Minimal, Nested Subspaces

We briefly remark on the relation to minimal subspaces (cf. [46, Chapter 6]). In the situationof Theorem 3.16, as the tree G is assumed to have d singleton legs, the finite ranks re,e ∈ E, of T (more specifically the existence of the network N) imply that every rankαJ (T ),J ⊂ D = 1, . . . , d, is finite, and thereby

T ∈ Hα1 ⊗a . . .⊗a Hαd .

In particular, the so-called minimal subspaces UminJ ⊂ HαJ associated to subsets J ⊂ D are

(well-)defined as those minimal subspace, such that for every partitioning⋃ki=1Ji = D, we

have that

T (αJ1 ),...,(αJk ) ∈ UminJ1 ⊗a . . .⊗a Umin

Jk.

For each edge e = v, w ∈ E, the minimal subspace corresponding to Jv,w is of dimensionre and given by

UminJv,w = span(Nw(βie = j)u∈branchv(w)\w Nu | j = 1, . . . , re).

Thus, if the node w is not contained in a leg, α ∩m(w) = ∅, then (up to permutation)

UminJv,w ⊂ a

⊗h∈node+v (w)

UminJv,h

.

Otherwise, if α ∩m(w) = αj, then

UminJv,w ⊂ a

⊗h∈node+v (w)

UminJv,h⊗a Hαj .

These subspaces are therefore also called nested, and the above subset relations reflect therank inequalities given in Remark 3.17.

3.4 Tree SVDs for TT and Tucker

The particular cases of the tree SVD for the tensor train format and the Tucker formatexhibit properties which are known in literatue in different contexts, as discussed in thefollowing. Here, sv+(A) denotes the positive singular values of a quadratic matrix A.

3.4.1 TT-Tree SVD / Canoncial MPS

In Section 2.5.1, we have already considered the tensor train decomposition. A tensorT = T (α1, . . . , αd) ∈ Rn1×...×nd can thereby be written as

T = T (α) = G1 . . .Gd,

where G1 = G1(α1, β1) ∈ Rn1×r1 , Gµ = Gµ(βµ−1, αµ, βµ) ∈ Rrµ−1×nµ×rµ , µ = 2, . . . , d − 1and Gd = Gd(βd−1, αd) ∈ Rrd−1×nd . The tensor train ranks rµ = d(βµ), µ = 2, . . . , d − 1,are defined as follows. We recall that αJ := αjj∈J .

Page 77: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 61

Definition 3.18 (TT-ranks and singular values). Let T = T (α) ∈ Rn1×...×nd be a tensorof dimension d ∈ N. Then the TT singular values are defined as

σTT := (σ(1)TT, . . . , σ

(d−1)TT ) := svTT(T ),

σ(µ)TT := sv+(T (αJ )), J = 1, . . . , µ.

Furthermore, the TT-ranks are given by rµ := rank(µ)TT(T ) := rankα1,...,µ(T ), µ = 1, . . . , d−

1.We omit the index TT when context is apparent.

The corresponding family of matricization (cf. Eq. (3.5)) is given by

KTT = 1, 1, 2, . . . , 1, . . . , d− 1. (3.23)

The tree SVD (Theorem 3.16) for the tensor train format yields the decomposition

T = T (α) = G1 Σ(1) G2 . . . Σ(d−1) Gd ∈ Rn1×...×nd (3.24)

where

G1 = G1(α1, β1) ∈ Rn1×r1 ,

Gµ = Gµ(βµ−1, αµ, βµ) ∈ Rrµ−1×nµ×rµ , µ = 2, . . . , d− 1,

Gd = Gd(βd−1, αd) ∈ Rrd−1×nd ,

Σ(µ) = Σ(µ)(βµ, βµ) = diag(σ(µ)) ∈ Rrµ×rµ , µ = 1, . . . , d− 1,

subject to the following gauge conditions: for each µ = 2, . . . , d − 1, the node Gµ Σ(µ) isβµ−1-orthogonal and Σ(µ−1) Gµ is βµ-orthogonal5. Furthermore, G1 is β1-orthogonal andGd is βd−1-orthogonal.

G1 G2 G3 G4

σ(1) σ(2) σ(3)

S2 = Σ(1) G2 Σ(2)

Figure 3.6: Tensor train tree SVD for dimension d = 4.

This decomposition has first been mentioned for the matrix product states format as canon-ical MPS [100]. It is very similar to the TT-SVD introduced in [84], except that the singularvalues here appear explicitly in the representation. By defining Sµ := Σ(µ−1) Gµ Σ(µ)

the above constraints can be restated as

Sµ \βµ Sµ = (Σ(µ−1))2, Sµ \βµ−1Sµ = (Σ(µ))2, µ = 2, . . . , d− 1.

3.4.2 Tucker Tree SVD / All-Orthogonality

In Section 2.5.2, we have already considered the Tucker decomposition, also known asHOSVD. A tensor T = T (α1, . . . , αd) ∈ Rn1×...×nd can thereby be written as

T = C U1 . . . Ud, C = C(β) ∈ Rr1×...×rd , Uµ = Uµ(αµ, βµ) ∈ Rnµ×rµ

where the ranks rµ = d(βµ) are defined as follows. Let therefore n 6=µ :=∏s6=µ ns.

5These two conditions are sometimes denoted as left- and right-orthogonality.

Page 78: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

62 3.5. Subspace Projections and Truncations

Definition 3.19 (Tucker ranks and singular values). Let T = T (α) ∈ Rn1×...×nd be a tensorof dimension d ∈ N. Then the Tucker singular values are defined as

σTucker := (σ(1)Tucker, . . . , σ

(d)Tucker) := svTucker(T ),

σ(µ)Tucker := sv+(T (αµ)), T (αµ) ∈ Rnµ×n 6=µ

while the ranks are given by rµ := rank(µ)Tucker(T ) = rankαµ(T ), µ = 1, . . . , d. We omit the

index Tucker when context is apparent.

The relevant matricization are those given through the corresponding family

KTucker = 1, 2, . . . , d. (3.25)

The tree SVD (Theorem 3.16) for the Tucker format (cf. [21]) yields

T = C (Σ(1), . . . ,Σ(d)) (U1, . . . ,Ud),C = C(β1, . . . , βd) ∈ Rr1×...×rd ,

Σ(µ) = Σ(µ)(βµ, βµ) = diag(σ(µ)) ∈ Rrµ×rµ ,Uµ = Uµ(αµ, βµ) ∈ Rnµ×rµ ,

where each Uµ is βµ-orthogonal and for S := C (Σ(1), . . . ,Σ(d)) it holds

S β\βµ S = (Σ(µ))2, µ = 1, . . . , d. (3.26)

This property is also called all-orthogonality, and has first appeared in [21, Theorem 2] forthe HOSVD.

CU1

U2 U3

U4

σ(1)

σ(2) σ(3)

σ(4)

S = C Σ(1) . . . Σ(4)

Figure 3.7: Tucker tree SVD for dimension d = 4 where S is all-orthogonal.

The identity ‖S‖F = ‖T‖F follows through the orthogonality constraints fulfilled by U , orvia

‖S‖2F = traceβµ(S β\βµ S) = traceβµ(Σ(µ))2 = ‖Σ(µ)‖2F = ‖T‖2F .

3.5 Subspace Projections and Truncations

In the general case, a tensor T = T (α) does not necessarily exhibit low or even finite ranks,

but an approximation T ≈ T may. In order to compute such, any occurring SVD in Algo-rithm 3 may be replaced by a truncated one. However, the computed values σe, e ∈ E, dothen not correspond to the singular values of the truncated output T = v∈VNv, such thatthe computation of Nσ is to be omitted.

Page 79: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 63

The nodes in Algorithm 3 are assigned leaves-to-root. Algorithm 6 as defined in Section 3.5.2realizes the converse order, root-to-leaves, proceeding a sequence of projection in order todecompose and truncate a tensor, as introduced in [40]. The therefor required relationbetween K and the graph G is first discussed in Section 3.5.1. Algorithms proceeding inthis order can however neither simultaneously compute the tree SVD nor keep the networkorthogonalized, even if all involved SVDs are exact.

3.5.1 Nested Projections

As mentioned in Section 3.3.1, each hierarchical family K yields a specific tree graph G(cf. Algorithm 5), and thereby a decomposition into a network (cf Theorem 3.16). In thissection, we discuss this connection in our framework. For simplicity, we only consider thecase Hαµ = Rnµ , nµ = d(αµ) ∈ N, µ = 1, . . . , d. In the following, let the tensor

T = T (α) ∈ Rn1×...×nd

be fixed. For J ⊂ D = 1, . . . , d, let UJ = UJ(αJ , γ) ∈ Rd(αJ )×k(J)

for some γ /∈ α stem

from a (truncated) SVD of T with respect to αJ , such that U(αJ )J contains the first k(J) ∈ N

left singular values of T (αJ ) (cf. line 12, Algorithm 6). For H ∈ Nψ, αJ ⊂ m(H), we defineπαJ analogous to [40] via

παJ (H) := UJ (UJ H).

Since for all T = T (α), we have that

παJ (T )(αJ ) = U(αJ ),(γ)J · (U (αJ ),(γ)

J )T · T (αJ ),

the map παJ is a projection onto the first k(J) left singular values of T (αJ ). In particular(cf. Definition 3.15)

παJ (T ) ∈ N = N(α) ∈ Hα | rankαJ (N) ≤ k(J) = T≤k(J),J(Hα).

Similar to the functions in Lemma 2.23, this map fulfills the important property

παJ (X Y ) = παJ (X) Y, παD\J (X Y ) = X παD\J (Y ), (3.27)

for all X,Y ∈ Nψ, αJ ∩ m(Y ) = ∅, αD\J ∩ m(X) = ∅, as carried out in the proof of thefollowing Lemma 3.20.

Lemma 3.20. Let T = T (α) and J, S ⊂ D. If S ⊂ J or S ∩ J = ∅, then

rankαJ (παS (T )) ≤ rankαJ (T ).

Proof. Let ` = rankαJ (T ). Then by definition of the rank, there are nodes X = X(αJ , ε),Y = Y (αD\J , ε), Hε = R`, such that T = X ε Y . Assume that S ⊂ J . Under use ofassociativity rules, we obtain that

παS (T ) = US γ (US αS T )

= US γ (US αS (X ε Y ))

= (US γ (US αS X))ε Y= παS (X)ε Y.

The tensor παS (T ) thus splits into X = X(αJ , ε) = παS (X) and Y = Y . Its rank withrespect to αJ can hence not have grown. If S ∩ J = ∅ ⇔ S ⊂ D \ J , then under additional

use of commutativity rules, it analogously follows that παS (T ) = X ε παS (Y ).

Page 80: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

64 3.5. Subspace Projections and Truncations

If J ∩ S = ∅, the projections παJ and παS even commute, i.e. παJ παS = παS παJ .For X and Y as in Eq. (3.27), we then have

(παJ παS )(X Y ) = παJ (X) παS (Y ).

Since the transposition of a matrix does not change its rank, we have rankαJ = rankαD\J .

However, the projections παJ and παD\J are not the same, except if applied to T = T (henceonly for the first projection proceeded).

The sets in a hierarchical family can be ordered, such that K = J1, . . . , J`,

Ji ⊂ Jj ∨ Ji ∩ Jj = ∅, ` ≥ i > j ≥ 1. (3.28)

Applied inductively, Lemma 3.20 hence provides that

T := (παJ` . . . παJ1 )(T ) ∈ T≤k,K(Hα). (3.29)

In other words, rankαJ (T ) ≤ k(J) for all J ∈ K. Each set Ji may also be replaced by itscomplement Jci = D \ Ji.Lemma 3.21 (Roots-to-leaves truncation error [40]). Let Jiki=1 be an ordered, hierarchical

family as in Eq. (3.28) and let T be as in Eq. (3.29). Further, let Tbest be a best approximationto T , subject to rankαJi (Tbest) ≤ k(Ji), i = 1, . . . , k. Then

‖T − Tbest‖ ≤√k ‖T − Tbest‖.

Proof. The statement is [40, Theorem 3.11].

Error estimates for leaves-to-root truncations exist as well, but with a different constant.The article [40] also provides further details on such sequences of projections.

3.5.2 Root-to-Leaves Truncation and Decomposition

As indicated above, each projection παJ also yields a decomposition

παJ (T ) = X Y, X := UJ , Y := UJ T .

Due to the property Eq. (3.27), this decompositions is retained in any subsequent one subjectto the ordering Eq. (3.28). A sequence of projections Eq. (3.29) for an ordered family K canhence be used to construct a network N that corresponds to graph G specified by K as inAlgorithm 4, N = hrtlt(T, J). The graph can also be calculated independently throughAlgorithm 5, G = rtlgraph(J), while Algorithm 6, N = rtltrunc(T,G, c) for a rootc ∈ V , directly calculates the decomposition based on such a graph G. For such, we have

N = hrtlt(T, J) = rtltrunc(T,G, 1), G = rtlgraph(J). (3.30)

For d = 4, for example, the sequence J = (J1, J2, . . .) given by either

J = (1, 2, 1, 2, 3, 4) or J = (1, 2, 3, 4, 3, 4) (3.31)

yields the HT-decomposition Eq. (2.36) (although each with different distribution of edgemode labels βi, i = 1, . . . , |E|). On the other hand

J = (1, 2, 3, 1, 2, 1) as well as J = (1, 2, 1, 4) (3.32)

Page 81: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 65

Algorithm 4 Root-to-leaves truncation (based on a hierarchical sequence)

Input: a tensor node T = T (α) and an ordered, hierarchical sequence Ji, i = 1, . . . , k (cf.Eq. (3.28))

Output: tree tensor network N corresponding to K = Jiki=1 such that T = v∈VNv (as quasi-best approximation, if truncated)

1: procedure hrtlt(T, J)

2: N := Nvv=1,...,k+1 . labels m(Nv), v = 1, . . . , k + 1, change during runtime

3: N1 ← T

4: for i = 1, . . . , k do5: let p ∈ 1, . . . , i be the unique index for which αJi ⊂ αJp ⊂ m(Np)6: do a (truncated) SVD of T with respect to αJi : . T remains unchanged

(U, s, Vt)← SVD(T ), s = s(βi)

7: Ni+1 ← U . U is βi-orthogonal8: Np ← U Np . cf. Theorem 3.229: end for

10: N← N11: return N12: end procedure

produce the TT-decomposition. Last but not least,

J = (1, 2, 3, 4),

regardless of any permutation of the entries, gives the Tucker decomposition. When themode labels associated to edges are assigned as in Algorithm 5, line 10, then m(eJi) = βi.Furthermore, for any path p = (1, . . .), it holds m(pi, pi+1) < m(pi+1, pi+2) (assuming β` <βj for all ` < j).

Theorem 3.22 (Root-to-leaves truncation). Let the network N be the output of Algorithm 4for an ordered, hierarchical family K = Jiki=1 as in Eq. (3.28) as well as G the graphcorresponding to N. Then K is the family specified by G in the sense of Eqs. (3.5) and (3.7).

It further holds T = v∈VNv for T as in Eq. (3.29) and rankαJe (T ) ≤ d(m(e)) for all e ∈ E.

Proof. We prove that inductively, at the beginning of each step i, that is line 5, it holdsv∈1,...,iNv = (παJi−1

. . . παJ1 )(T ) and m(v∈branch1(j+1)Nv) ∩ α = αJj for all j =1, . . . , i− 1. The induction start is trivial. By induction hypothesis, in step i, we have

παJi (v∈1,...,iNv) = παJi (Np)v∈1,...,i\p Nv

since Ji ⊂ Jp and αJp ⊂ m(Np), where παJi (Np) = N+p Ni+1, N+

p := U Np. Further,the sets of outer mode labels is not changed for any previous j < i. For j = i+ 1, we havebranch1(i+ 1) = branchp(i+ 1) = i+ 1 and thereby

m(v∈branch1(i+1)Nv) ∩α = m(Ni+1) ∩α = αJi ,

where the branches are with respect to the tree graph corresponding to the current subnet-work N1, . . . , N

+p , . . . , Ni+1. This finished the induction.

Page 82: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

66 3.5. Subspace Projections and Truncations

Algorithm 5 Root-to-leaf graph creation

Input: an ordered sequence of subsets Ji ⊂ α, i = 1, . . . , k (cf. Eq. (3.28))Output: graph G = (V,E, L) corresponding to family K = Jiki=1, with edge label map m, and

sets b

1: procedure rtlgraph(J)2: V ← 1 . vertex set3: E ← ∅ . edge set4: L← V . leg set5: m(1)← α . edge label function

6: for i = 1, . . . , k do7: let p := P ∈ L be the unique index for which αJi ⊂ αJp ⊂ m(p)8: V ← V ∪ i+ 19: E ← E ∪ p, i+ 1

10: m(p, i+ 1)← βi

11: L← L ∪ i12: m(p)← m(p) \ αJi13: m(i+ 1)← m(i+ 1) ∪ αJi14: if m(p) = ∅ then15: L← L \ P . Remove nodes without outer mode labels from the set of legs16: end if

17: bi+1(p)← α \ αJi , bp(i+ 1)← αJi . cf. Theorem 3.1618: end for

19: return G = (V,E, L), m, b20: end procedure

Ni+1 Np

(αJi)1

(αJi)2

βibranch1(p)T

T=αJi

αD\Ji

Ni+1 U Np

(αJi)1

(αJi)2

αJiβiNi+1 = U

N+p = U Np

T+

TUU=αJi

αD\Jiβi αJi

Figure 3.8: The i-th step in the root-to-leaves truncation in Algorithm 4. Top: the dashed line within theteal area of nodes indicates that the rest of the network is in between the connected nodes. The nodes inbranch1(p) (with respect to the final graph) are not yet assigned, by which αJi = (αJi )1, (αJi )2 ⊂ Np.

Bottom: the node Ni+1 has been assigned and N+p has been updated, forming the projected tensors T+ =

παJi (T ).

Page 83: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 67

Algorithm 6 Root-to-leaves truncation

Input: a tensor node T = T (α) and a tree graph G = (V,E) (as in Theorem 3.16)Output: tree tensor network N corresponding to G such that T = v∈VNv (as quasi-best approx-

imation, if truncated) as well as a root node c ∈ V

1: procedure rtltrunc(T,G, c)

2: N := Nvv∈V . m(Nv), v ∈ V , changes during runtime

3: declare N and T nested variables regarding rtltrec . ∗as well as T , T ← T4: rtltrec(c, ∅)5: set N← N . N now corresponds to G6: return N . ∗now T equals v∈VNv7: end procedure

8: function rtltrec(b, P ) . |P | ≤ 19: if ∅ 6= P =: p then

10: γ ← m(b, p) . m(b, p) = m(b) ∩m(p)11: δ ← bp(b) . bp(b) =

⋃`∈branchp(b) m(`), cf. Section 3.3.1

12: do a (truncated) SVD of T with respect to δ:

(U, s, Vt)← SVD(T ), s = s(γ)

13: Nb ← U . incorporates branchp(b), U is γ-orthogonal

14: Np ← Np U . incorporated part of graph is reduced by branchp(b)

. ∗ T ← (T U) U15: else16: Nb ← T . at the start incorporates the whole set of nodes V17: end if

18: for h ∈ neighbor(b), h /∈ P do . may be performed in parallel19: rtltrec(h, b) . node+

c (b) = neighbor(b) \ P20: end for . from here on, Nb will not change anymore21: end function

Page 84: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

68 3.6. Operations on Tree Tensor Networks

3.6 Operations on Tree Tensor Networks

Since usually, the tensor T = T (α) is too large to work with explicitly, computations relydirectly on the decomposition N by which it is represented. Similarly to the orthogonaliza-tion towards one node as in Algorithm 1, the tree SVD of T can be calculated within thenetwork N. This specific representative of the network is also referred to as canonical ornormal form.

3.6.1 Normalization of a Tree Tensor Network

Let N = Nvv∈V be a tree tensor network. Algorithm 7 calculates the tree SVD (Theo-rem 3.16), or normal form, of T = v∈VNv without explicitly computing T .

Algorithm 7 Recursive normalization of a network

Input: a tree tensor network N = Nvv∈V corresponding to a graph G = (V,E) and aroot node c ∈ V

Output: the c-orthogonalized network N and its tree SVD Nσ

1: procedure normal(N, c)2: N := Nvv∈V , σ := σee∈E3: N← ortho(N, c) . N is now c-orthogonal4: declare N, N, σ nested variables regarding standrec5: normalrec(c, ∅)6: return N, Nσ = N ∪ σ . N is still c-orthogonal7: end procedure

8: function normalrec(b, P ) . |P | ≤ 1

9: Nb ← N b

10: for h ∈ neighbor(b), h /∈ P do . may be performed in parallel11: γ ← m(h, b) . node+

c (b) = neighbor(b) \ P12: do a QR-dec. of N b with respect to γ:

N b = QR, R = R(γ, γ)

13: Nh ← RNh . see Remark 3.2314: Vt ← normalrec(h, b)15: Nb ← (Σh,b Vt) (Nb R†) . possibly shrinks mode size d(γ)16: end for17: N b ← Nb

18: if ∅ 6= P =: p then . the first time this is called is at a leaf19: γ ← m(p, b) . m(p, b) = m(p) ∩m(b)20: do an SVD of N b with respect to m(b) \ γ:

N b = (U, s, Vt), s = s(γ), Vt = Vt(γ, γ)

21: N b, N b ← U . U is γ-orthogonal22: σp,b ← s23: else24: N b ← N b

25: end if

26: for h ∈ neighbor(b), h /∈ P do

Page 85: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

3. Tree Tensor Networks 69

27: N b ← N b Σ−1h,b

28: end for

29: if ∅ 6= P then30: return Vt31: end if32: end function

Remark 3.23. Algorithm 7 may be modified as follows:

• remove lines 9 and 17

• add the operation N b ← Q (equivalent to N b ← N b R†) to after line 13

• replace line 15 by N b ← (Σh,b Vt)N b

These changes cause the network to actually be h-orthogonal at the beginning of line 14,and, together with the uniqueness of the tree SVD, in particular iii), provide the validity ofthe algorithm.

However, the above assignment is postponed in the original version in order to be able torun the for loop in parallel, considering that N b is not needed within the branch on whichthe recursive call normalrec(h, b), line 14, operates. The modified version is more stablethough, since the pseudoinverse of R can be avoided.

3.6.2 Truncation of Normalized Networks

A tensor may also be truncated through operations within its decomposition.

Proposition 3.24 (Truncation based on tree SVD). Let T = T (α) ∈ Hα and let Nσ be itstree SVD with respect to the graph G = (V,E). Further, let re ≤ re = rankαJe (T ), e ∈ E.We define the truncated representation R = Rvv∈V via

Rv := N σν (m(e) ∈ 1, . . . , re, e ∈ E), ν ∈ V = V ∪ E.

Then the truncated network represents the same tensor that the root-to-leaves truncation toranks re, e ∈ E, yields,

ν∈V Rν = v∈VRv, R = Rvv∈V = rtlt(T,G, c),

where c ∈ V is an arbitrary root node (cf. Algorithm 6).

Note however that R is usually not a tree SVD, unlike in the matrix case |E| = 1.

Proof. Due to Theorem 3.22 and Eq. (3.30), we only need to show that the restriction of thedomains of the inner mode labels in Nσ is equivalent to a hierarchical sequence of projectionsπαJi , i = 1, . . . , |E|, where each Ji equals either Je or its complement D \ Je for some edgee ∈ E. The direction must further follow the order root-to-leaves. This is however directlyprovided by property iii) of the tree SVD.

A direct use of the tree SVD for Proposition 3.24 is not necessary. If N only differs fromN regarding multiplications of Σv,w into neighboring nodes Nv or Nw, v, w ∈ E, as inAlgorithms 3 and 7, then R,

Rv := Nv(m(e) ∈ 1, . . . , re, e ∈ E), v ∈ V.

is likewise a root-to-leaves truncation of T = v∈VNv.Corollary 3.25 (Independency of root). The root-to-leaves truncation to ranks re is inde-pendent of the chosen root node c, and independent of the specific root-to-leaves order ofprojections.

Page 86: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

70 3.6. Operations on Tree Tensor Networks

Proof. Given Proposition 3.24, there is nothing to show, since the tree SVD does not dependon any kind of root.

By translating the two HT sequences Eq. (3.31) in the case d = 4 into projections, wereceive

©i∈1,...,4 παi πα1,α2=©i∈3,4 παi πα3,α4

©i∈1,2 παi .

The TT example Eq. (3.32) on the other hand gives

©i∈1,4 πi πα1,α2= πα1

πα1,α2 πα1,α2,α3

.

Page 87: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Part II

High-DimensionalApproximation

Page 88: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 89: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 4

Low-Rank Tensor MinimizationProblems and Alternating CG

In Chapters 2 and 3, we have formalized tensor node networks and discussed related de-compositions. We now apply these tools to the setting of linear least squares problems asintroduced in the subsequent section, which further accompanies us throughout Chapters 5and 6.

4.1 Preface

A large class of problems, despite their appearances in high-dimensional settings, can betraced back to conventional, possibly over- or underdetermined, linear systems. Given a(continuous) linear operator

L : RnD → RmD , mD, nD ∈ N,

one seeks to recover an unknown u ∈ RnD from observed values

y = L(u) + η ∈ RmD ,

where η may be some form of noise or measurement error. A convenient approach to findthe solution u is to (iteratively) minimize the distance,

‖L(x)− y‖2 → min, (4.1)

under the restriction x ∈ S for some favored subspace S ⊂ RnD . Given a symmetric,positive-definite operator A : RnD → RnD as well as a vector g ∈ RnD , the problem to

solve A(x) = g (4.2)

can in turn be traced back to this minimization task, Eq. (4.1). Consider therefor that Aadmits a decomposition

A = L∗ L, g = L∗(y),

where L∗ is the adjoint of L. The task to minimize the so-called energy function,

〈x,A(x)〉 − 2〈x, g〉 → min, (4.3)

then yields the same solution as both Eqs. (4.1) and (4.2). For our purposes it is moreconvenient to assume L and y to be given, but it turns out (cf. Eq. (4.15)) that these do

73

Page 90: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

74 4.2. Overview over Literature

not need be known explicitly in order to approach Eq. (4.2).

Albeit subject to related concerns, there are two distinct reasons for the involvement of low-rank constraints and the continued or artificial1 interpretation as high-dimensional problem.It may be that the size of the problem

(1) nD is too large (Section 4.3),

by means of that conventional approaches exceed a feasible computational complexity. Al-though theoretically straightforward, it may then not be possible to recover the solutionexactly. Alternatively, or additionally, the system given in Eq. (4.1) may be (vastly) under-determined,

(2) mD nD, (Section 4.5),

such that some sort of additional regularization or assumption towards x ∈ S is required.For example, in a completion problem, the vector y is a subset of entries P = p(1), . . . , p(m)of u, i.e. yi = L(u)p(i) , i = 1, . . . ,m.

In this part, we assume that the unknown allows for a low-rank tensor approximation,when interpreted as element of

Rn1×...×nd ∼= RnD , n ∈ Nd, nD = n1 . . . nd.

In the first case, we assume that L and y exhibit some low-rank tensor structure as well,provided mD = m1 . . .md, m ∈ Nd. Since the problem in its original form already has aunique solution if only L is injective, these assumptions are made in order sustain compu-tational feasibility.

In the second case, since the system is underdetermined, the low-rank assumption towardsthe solution u is essential to allow for an approximate reconstruction. For computationalfeasibility, we also assume that L has a certain low-rank structure, whereas y is considereda collection of m = mD ∈ N linear measurements and not interpreted as tensor.

In this chapter, we assume that we know which choice of fixed rank for the iterate x isreliable, and approach the problem Eq. (4.1) through alternating least squares (ALS) tensormethods. This discussion mainly serves as introduction to Chapter 5, in which we considerthe more complicated rank-adaptive case, in particular tensor completion. We focus mainlyon algorithmic and structural aspects, and refer to [22, 28, 82, 88, 99] for low-rank relatedconvergence analysis.

4.2 Overview over Literature

The two cases in Section 4.1 can both be approached through alternating least squaresapproximation. However, they behave differently in theory and might arise from variousapplications. Therefore, specialized methods can be found throughout literature. While weonly discuss articles related to our settings, we refer to the survey [43] for a wider overview.

4.2.1 Large-Scale Problems

Instances within class (1) mainly stem from classical problems that have been transferred tolarge scale, surveys of which are [4,7]. One can apply in essence ordinary, iterative methods,such as among many others [2,3,5,75], in order to solve Eq. (4.1) in this case. All operations,

1Such as it is the idea in QTT methods.

Page 91: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 75

for example the evaluation L(x) or the addition of vectors and multiplication with matrices,are then effectively performed exploiting the given low-rank properties. Since the rank ofthe iterate typically increases during such procedures, the iterate has to be truncated aftereach step. So most methods basically follow a scheme

x+ = Trr+ Itr(x),

where Trr+ is a truncation to a suitable, updated low rank r+ and Itr is a series of conven-tional operation implicitly applicable to rank r tensors. The mapping Trr+ itself and themost effective choice of r+ in each iteration then constitute central concerns.

In contrast to the methods mentioned above, there are approaches which directly rely onthe tensor structure. Alternating optimization methods [22,56,81], such as ALS, are in theirsimplest forms rank preservative, and their convergence has been analyzed in particularin [29,88,97]. Riemannian optimization is considered a fixed-rank method as well, as in eachstep the iterate is retracted to the same fixed low-rank tensor manifold [57, 70]. In orderto control the rank, the latter methods may be combined with a separate rank adaption,or utilize DMRG [56, 61]. In this method, two neighboring nodes Nv and Nw within therepresentation N (cf. Section 4.3) are temporarily combined to a so-called supercore

Ne := Nv Nw, e = v, w ∈ E,

which is then optimized as such. Subsequently, the updated rank r+e as well as N+

v and N+w

are based on N+e , for example given through an SVD of the supercore. An overestimation of

re usually only slows down the computation, yet due to effects such as overfitting, DMRGis not recommendable for underdetermined systems.

4.2.2 Tensor Recovery and Completion

The situation within class (2) on the other hand is closely related to compressed sensing,where the operator L usually fulfills the so-called tensor restricted isometry property (TRIP)as we will further discuss in Section 4.6. The problem may be approached through iterativehard thresholding [85]. Another possibility is nuclear norm minimization, which has a verystrong theoretical background [11,14,45,86] for the matrix case, d = 2. Instead of prescribinga fixed rank, the minimization of the rank is the target function, and the linear system theconstraint. The problem is then approached through a convex relaxation of the rank,

find argminx‖x‖∗ subject to ‖L(x)− y‖2 ≤ δ (4.4)

which aims to minimize the rank of x for some tolerance δ, and thereby approximately re-covers the sought solution u. The nuclear norm ‖x‖∗, when x is interpreted as a matrix,is given by the sum of its singular values. The tensor nuclear norm however is NP-hard tocompute [35].

It appears that this approach is outperformed in practice by alternating least squares ap-proaches [60] and the simplifications required for an adaption to tensors [38, 73, 92] do notseem to allow for an appropriate generalization [77]. These methods may even retreat toalternating optimization [50] themselves. Reweighted least squares on the other hand origi-nates from compressed sensing [16,19] and has since, within the framework of nuclear normminimization, been generalized to matrix recovery [32, 76]. We discuss this method andits successful generalization to the tensor setting at length in Section 5.4. In the case ofcompletion, where only a fraction of entries of u is available, the TRIP is not fulfilled. Thecondition has thus been replaced by an assumption towards u called incoherence [11, 14].

Page 92: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

76 4.3. ALS for Linear Equations under Low-Rank Constraints

However, the related literature may be slightly misleading (in particular [14, Theorem 1.7])concerning the general necessity of incoherence in order to provide a unique completion, asopposed to its necessity for nuclear norm minimization to succeed. Let us therefor considerthe rank one matrix

M = zT z ∈ R10×10, z = (1, ε110 , . . . , ε

910 ), ε > 0,

where we treat the entry M1,1 (or any other) as single unknown. The lower ε, the worse isthe incoherence property of M2. A nuclear norm completion (Eq. (4.4) for δ = 0) will fail toreconstruct M1,1 = 1 for about ε < 0.05, while there is clearly a unique rank-one matrix thatis consistent with the sampling. Moreover, fixed rank-one methods as well as reweightedleast squares seemingly recover the missing entry for all ε > 0. Although oversimplified,the example reflects on that there are certainly matrices that do not obey the incoherenceproperty, which are however still easy to complete under low-rank assumptions.

Riemannian optimization [94] and other least squares based approaches in hierarchical for-mats [69, 93] are alternatives here as well. As mentioned, rank adaption for the lattermethods is not easy. While often greedy heuristics are applied, there are inherent problemswith such as we discuss in Chapter 5. Most notable here are the rank increasing strategiesin [41, 102]. It is important to note that we are not allowed to chose the sampled entries.Otherwise, methods such as cross-approximation for tensors are recommendable [6,83]. Forexample, a rank r matrix M ∈ Rn×n may be recovered through r rows MI,: and r columnsM:,J of M , since

M = MI,: ·M−1I,J ·M:,J , for all I, J ⊂ 1, . . . , n, |I| = |J | = r,

as long as MI,J ∈ Rr×r is invertible. Furthermore, if M is only nearly low-rank, then suchrank r reconstructions can be proven to be an excellent approximation, provided MI,J issuitable. Similar bounds also hold true for tensor cross-approximation.

4.3 ALS for Linear Equations under Low-Rank Con-straints

In this section, we discuss the approach of the minimization problem Eq. (4.1) for the first,not underdetermined case (1) through alternating least squares, assuming that the operatorL is injective (that is, A is positive-definite). In order to emphasize on the tensor structure,we replace the vector of unknowns x with the tensor node

x = N (α), N = N(α) ∈ Rn1×...×nd ∼= RnD

and the vector of observed measurements y with

y = T (δ), T = T (δ) ∈ Rm1×...×md ∼= RmD

for a dimension d ∈ N. Let K ⊂ J | J ⊂ 1, . . . , d − 1 be a family of subsets andr ∈ NK a collection of ranks. The task Eq. (4.1) for N ∈ S := T≤r,K then has the equivalentformulation

find argminN∈Hα

‖L(N)− T‖ subject to rankαJ (N) ≤ r(J), J ∈ K. (4.5)

2There are different definitions of incoherence, but it basically means that some entries of z vanish, whileothers are close to one.

Page 93: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 77

Since the operator L is linear, we can express its action through a tensor node L as well,

L(N) = LN, (4.6)

L = L(δ,α) ∈ R(m1×...×md)×(n1×...×nd) ∼= RmD×nD .

As we have seen in Chapter 3, not every family K is practicable. We will therefore assumethat the set is hierarchical in the sense of Definition 3.14. The tensor N is represented in acorresponding tree network N (cf. Definition 3.7). Likewise, both L and T are thought tobe provided in form of a low -rank decomposition L and T. To simplify allocation, we usethe same letters for these tensors and the sets of nodes which represent such, such that

L = v∈VLLv, N = τ(N) = v∈VNNv, T = v∈VT Tv,for (not necessarily identically shaped) graphs GL = (VL, EL), GT = (VT , ET ) and the treeGN = (VN , EN ) that corresponds to the hierarchical family K. Without loss of generality,we assume that K is such that the legs of GN are singletons, i.e. LN = αµ | µ = 1, . . . , d.Since the contraction operation τ on N is a multilinear map (cf. Remark 3.3), each singlenode in VN yields a linear subproblem and update

N+v := argmin

Nv

‖LN − T‖ = argminNv

‖(LN6=v)Nv − T‖, (4.7)

for N 6=v := w∈VN\vNw. In the (in practice neglectable) case that N+v is thereby not

uniquely determined, we choose the element which provides the lowest norm ‖N 6=v N+v ‖.

The solution to each of these subproblems is given through the normal equation3

(LN 6=v)T (LN 6=v N+v ) = (LN6=v)T T. (4.8)

At this point we make use of the renaming operations introduced in Section 2.7. We definecopies of the nodes in L and N with modified modes labels in order to simplify the equation.Let therefore β and ε be the inner mode labels (cf. Section 3.1) within the networks N andL, respectively. We denote

L′ := L(α 7→ α′), L′v := Lv(α 7→ α′, ε 7→ ε′), v ∈ VL, (4.9)

N ′ := N(α 7→ α′), N ′v := Nv(α 7→ α′,β 7→ β′), v ∈ VN .All nodes in N′ and L′ are meant to be (at all times) identical to N and L, with exceptionof their altered mode labels. The components in Eq. (4.8) thereby have uniquely assignedmode labels and we can reformulate

N ′6=v L′ LN 6=v N+v = N ′6=v L′ T. (4.10)

N ′6=v L′ L N6=v N+v = N ′6=v L′ T

β′ \ γ′ α′ \ γ′ δ

α′ ∩ γ

′α \ γ β ∩ γ

α ∩ γ

β′ \ γ′ α′ \ γ′ δ

α′ ∩ γ

Figure 4.1: Eq. (4.10) as network, for γ = m(v) and γ′ = m(v)′ = m(N ′v).

Each of the contractions involving more than one node can be performed more effectivelythe lower the ranks of (and the more similar) the occurring networks are (cf. Section 4.3.2).Often there is a slight formal and computational difference between nodes Nv which aredirectly connected with the linear operator L, and those which are not:

Definition 4.1 (Nodes with outer mode names). With V (outer) = v(outer)1 , . . . , v

(outer)d ⊂

VN , we denote the nodes of the network N with outer mode labels, such that here αj =

m(Nv(outer)j

) ∩α = m(v(outer)j ).

3The transpose here is not necessary since δ 6= α is assumed, but we use it once to compare to conventionalnotation.

Page 94: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

78 4.3. ALS for Linear Equations under Low-Rank Constraints

4.3.1 Arbitrary Tree Tensor Networks

The basic alternating least squares (ALS) method successively replaces the singles nodes Nvwith the updates N+

v as above (Eq. (4.10)) in so-called micro-steps M(v):

M(v)(N) := Nww 6=v ∪ N+v , v ∈ VN . (4.11)

Due to the optimality of each update, the residual declines monotonically,

‖L τ(M(v)(N))− T‖ ≤ ‖LN − T‖.

For stability reasons, before each micro-step, the decomposition Nww∈VN is orthogonalizedwith respect to v (cf. Algorithm 1), without changing the represented tensor N. One sweepapplies all micro-steps and orthogonalizations once,

sweep(N) =©v∈VN (M(v) ortho(·, v))(N), (4.12)

possibly in a specific order. This process is repeated until the residual stagnates or someother termination criteria are met. Assuming the network is already orthogonalized withrespect to w, it is more efficient to instead use Algorithm 2 as

ortho(·, v) = pathqr(ortho(·, w), w, v).

Further, the computational complexity of this orthogonalization is lower if the two nodes vand w are close to each other in the graph VN . The micro-steps are therefore executed inthe same leaves-to-root order as in previous network algorithms. The vanilla version of theALS linear solver is stated in Algorithm 8.

4.3.2 Branch-Wise Evaluations for Equally Structured Networks

The updates in the micro-steps, Eq. (4.10), can be calculated in a computational complexityindependent of the number of nodes if the networks L, N and T have identical, correspondingtree graphs G := GL = GN = GT (modulo multiplicity of their legs); for example if allnetworks constitute tensor trains, Fig. 4.2, or the same binary hierarchical format, Fig. 4.3.For the given hierarchical family K, one then desires that the ranks

r(J) = rank(N (αJ )), rank(T (αJ )) and rank(L(δJ∪αJ ))

are low for all J ∈ K. We recall that with respect to the tree G, each set J ∈ K correspondsto an edge e ∈ E as discussed in Section 3.3.1. The terms appearing in the normal equationEq. (4.10) can efficiently be handled. For the left-hand side, we have

N ′6=v L′6=v L6=v N6=v = h∈neighbor(v) B(N′L′LN)

v,h ,

B(N′L′LN)

v,h := w∈branchv(h) S(N′L′LN)

w ,

S(N′L′LN)

w := N ′w L′w Lw Nw.

These branch products are recursively structured,

B(N′L′LN)

v,h = S(N′L′LN)

h w∈neighbor(h)\v B(N′L′LN)

h,w .

All analogously defined terms follow this scheme, such as the right-hand side, for which wedefine

B(N′L′T)

v,h := w∈branchv(h) S(N′L′T)

w ,

S(N′L′T)

w := N ′w L′w T.

Page 95: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 79

Algorithm 8 Alternating least squares, linear solver

Input: tensor node networks L, N, T as in Eq. (4.6) and a root node c ∈ V (as well as itermax ∈ N)Output: approximate solution N to ‖LN − T‖ → min

1: procedure linmin(L,N,T, c)2: set N← ortho(N, c) . N is now c-orthogonal3: set v ← c4: declare N, L, T, v nested variables regarding sweeprec5: define L′ and N′ as in Eq. (4.9)

6: for iter = 1, . . . , itermax do . or until some tolerance is reached7: sweeprec(c, ∅)8: end for

9: return N . approximately LN ≈ T10: end procedure

11: function sweeprec(b, P ) . |P | ≤ 112: for h ∈ neighbor(b), h /∈ P do . with respect to the graph GN13: sweeprec(h, b) . node+

c (b) = neighbor(b) \ P14: end for

15: set N← pathqr(N, v, b) . changing N′ as well16: set v ← b17: solve and replace Nv ← N+

v ,

(N ′6=v L′ LN 6=v) N+v = N ′6=v L′ T .

18: end function

Within each sweep, B(N′L′LN)

v,h and B(N′L′T)

v,h are evaluated explicitly and used as such in themicro-steps. For each single node v, Eq. (4.10) then states as follows (cf. Figs. 4.2 and 4.3):

W ′6=v N+v = H ′6=v (4.13)

for the nodes

W ′6=v := N ′6=v L′ LN 6=v = (L′v Lv)h∈neighbor(v) B(N′L′LN)

v,h (4.14)

H ′6=v := N ′6=v L′ T = (L′v Tv)h∈neighbor(v) B(N′L′T)

v,h .

The mode labels of all B(N′L′LN)

v,h , h ∈ neighbor(v), are distinct and share at most one modelabel with each N ′v, Nv, L

′v and Lv. Since not every node necessarily contains an outer

mode label in α or δ, the element Lv may for example not require a contraction with Nv(cf. Fig. 4.3). The terms

Av := L′v Lv, Gv := L′v Tv, (4.15)

if not given as such (as for Eq. (4.2)), can be evaluated once beforehand,

A = v∈VAv, A = A(α′,α), G = v∈VGv, G = G(α′).

In that case, there is no need for the mode labels ε′ and δ, and we assume that the innermode labels of the network Yvv∈V are just given through ε. Analogous to the energyfunction as given in Eq. (4.3), the original problem can then be restated as

find argminN

N ′ AN − 2 ·N ′ G. (4.16)

It depends on the specific setting which approach is more useful.

Page 96: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

80 4.4. Alternating CG

4.4 Alternating CG

In order to solve the linear system Eq. (4.13), it may directly be interpreted as ordinarymatrix-vector equation

W(γ),(γ)6=v (N+

v )(γ) = H(γ)6=v , γ = m(Nv). (4.17)

However, since the node Nv is to be updated again in the next sweep, it is unnecessary to

find the exact solution. Furthermore, the matrix W(γ),(γ)6=v ∈ Rd(γ)×d(γ) can become quite

large. It is here better to solve Eq. (4.17) (for example) via an iterative solver, as presentedin [81] for the tensor train format. We consider the conjugate gradient method with coarsetolerance. The multiplication with (the matrix) W 6=v can efficiently be performed as thesingle multiplications with the comparetively small branch products in Eq. (4.14) are lesscostly. The former node Nv furthermore serves as ideal starting value. The number ofrequired CG steps, given a tolerance, is related to the iTRIP constant of N as we willdiscuss in Section 4.6. The network N should therefore be v-orthogonal in order to allow

for a better condition number of W(γ),(γ)6=v .

N ′1 N ′2 N ′4

L′1 L′2 L′3 L′4

L1 L2 L3 L4

N1 N2 N3 N4

B(N′L′LN)

3,4 B(N′L′LN)

3,2

Figure 4.2: Example for Eq. (4.13) and the micro-step M(v) for v = 3. The tensor train network N is 3-orthogonal. It further holds neighbor(v) = 2, 4. The dashed lines around the pairs indicate Av = L′vLv,v ∈ V . All nodes but N3 (and N ′3) form W3.

N ′1 N ′2 N ′4

N ′5N ′6

A1 A2 A3 A4

A5A6

N1 N2 N3 N4

N5N6

B(N′AN)

3,2 B(N′AN)

3,4

B(N′AN)

3,5

Figure 4.3: Example for Eq. (4.13) and the micro-step M(v) for v = 3. The hierarchical Tucker networkN is 3-orthogonal. It further holds neighbor(v) = 2, 4, 5. Here, we have replaced Av = L′v Lv, v ∈ V .The nodes L3, L′3 are not connected with N3, N ′3, respectively, since they have no outer mode labels, i.e.m(N3) ∩α = ∅. All nodes but N3 (and N ′3) form W3.

Page 97: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 81

4.4.1 Comparison of Computational Costs

We compare the computational costs of solving the full system Eq. (4.17) and of one conju-gate gradient step, if we are given A and G as in Eq. (4.15) (where A only has inner modelabels ε, and not additionally ε′). Further, for simplicity, we assume that the ranks andmode sizes are uniform, i.e. r ≡ d(βµ), k ≡ d(εµ) and n ≡ d(αµ) for each of the possible µ,for fixed k, r, n ∈ N. Thereby, each single element in B(N′AN)

v,h h∈neighbor(v) has the same size,and can be treated equally in the following complexity analysis.

Proposition 4.2. Let m = |neighbor(v)| be the number of neighbors of node v. Then theoptimal order of complexity for one multiplication of Nv with W 6=v subject to all possibleorders of contractions is

O(kdm2 erm+1n+ kmrmn2). (4.18)

The optimal order of complexity for the explicit calculation of (the matrix) W 6=v is

O(kmr2mn2). (4.19)

While we here derive an optimal order of contractions manually, the general task can becomputationally exhaustive. We refer to [91] for a further discussion of approaches to thisproblem.

Proof. Single multiplication: The node Nv needs to be multiplied with m equally sizednodes B(N′AN)

v,h , h ∈ neighbor(v), as well as the node Av. The s-th multiplication with a node

B(N′AN)

v,h has cost ks+1rm+1n, if Av has not already been multiplied. Otherwise, the cost is

km−srm+1n. The cost of multiplying Av after s multiplications of nodes B(N′AN)

v,h is in fact

independent of s and given as kmrmn2. For such s, the total cost therefore is

O(rm+1n(k + . . .+ ks) + kmrmn2 + rm+1n(km−s + . . .+ k)

)

= O(rm+1kmax(s,m−s)n+ kmrmn2

)

Hence, it is most efficient to choose s = bm2 c = max(z ∈ Z | z < m2 ). The optimal order

of contractions is thereby independent of k, r and n.

Full matrix: In order to compute the full matrix W 6=v, all its components have to be con-tracted. Analogously, the (s − 1)-th contraction with a node B(N′AN)

v,h has cost r2sks, if Avhas not already been multiplied. Otherwise, the cost is r2sksn2. The cost of contracting Avafter s ≥ 1 nodes B(N′AN)

v,h have been included is r2skmn2. The total cost for such s thereforeis

O(r4k2 + . . .+ r2sks + r2skmn2 + r2(s+1)ks+1n2 + . . .+ r2mkmn2

)

= O(r2mkmn2

)

and, although the order stays the same, lowest for s = m. Again, the optimal order ofcontractions is independent of k, r and n.

We further compare the order of computational complexity for the solution of Eq. (4.13)in the tensor train format (cf. Fig. 4.2) and binary HT format (cf. Fig. 4.3). In the following,we assume k ≤ r ≤ n ∈ N for simplicity as it mostly holds true. Then, in the first case, theorders of complexity for one CG step and the calculation of the full matrix in the regularcase v ∈ 2, . . . , d− 1 with m = 2 are

CGTT : O(k2r2n2), FullTT : O(k2r4n2),

Page 98: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

82 4.4. Alternating CG

respectively. In the second case, for a transfer tensor v ∈ V \ V (outer), m = 3 (setting n = 1since it does have an outer mode), we have

CGtransferHT : O(k2r4), Fulltransfer

HT : O(k3r6),

whereas for a leaf v = V (outer), m = 1, the cost is

CGleafHT : O(krn2), Fullleaf

HT : O(kr2n2).

Solving the linear system Eq. (4.13) using CG requires a certain number of t steps (relatedto the iTRIP, Section 4.6.2), whereas the solution via a full system of size N knowingly hasa cost bounded by 1

6N3 (using a Cholesky decomposition), which yields

CholTT : O(r6n3).

In the second case, for a transfer tensor v ∈ V \ V (outer), m = 3 (again setting n = 1), wehave

CholtransferHT : O(r9)

whereas for a leaf v ∈ V (outer), m = 1, the cost is

CholleafHT : O(r3n3).

In all cases, for k ≤ r ≤ n, the complexity of applying Cholesky exceeds the one to cal-culate W 6=v. We have certainly counted conventionally and ignored overhead of smallercomputations or parallelization. Nonetheless, in all three cases, even if the maximal numberof required CG steps were performed, the total cost would still be significantly lower. Inpractice however, we observe that usually O(1) steps are sufficient in the sense that theoutputs of corresponding algorithms are then near identical to those obtained through exactsolutions in each micro-step. Apart from that, the size of the node W 6=v may simply exceedmemory capacities. The cost to calculate H6=v is equal in both cases and neglectable.

4.4.2 Preconditioning

A good preconditioner for the CG method in order to solve Eq. (4.13) is the diagonal of

W 6=v = W 6=v(γ,γ), γ = m(Nv).

Let neighbor(v) = h1, . . . , hm. If Nv has an outer mode label, v = v(outer)j , then m(Nv) =

αj , βi1 , . . . , βim for certain i, j and ` such that m(B(N′L′LN)

v,hk) = β′ik , ε`k , βik, k = 1, . . . ,m.

When and after applying the (partial) diagonal operator, we here ignore4 any renaming ·′.Thereby

diagγ(W 6=v) = diagαj (Av)mk=1 diagβik

(B(N′L′LN)

v,hk).

Further, recalling that Av does not contain mode labels in ε′, we have (cf. Eq. (2.33))

(diagαj (Av))(ε) = (L′v(α

′j 7→ αj)αjδj Lv)

(ε′,ε).

The preconditioner can both be calculated and applied in neglectable order of complexity,even though the full node diagγ(W 6=v) needs to be evaluated. In the other case, v /∈ V (outer),i.e. α ∩ m(Nv) = ∅, the diagonal of W 6=v is calculated analogously as diagγ(W 6=v) =

Av mk=1 diagβik(B(N′L′LN)

v,hk).

4This means diagγ(H) = diagγ(H(γ′ 7→ γ)) for any node H = H(γ′,γ).

Page 99: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 83

4.5 ALS for Tensor Recovery and Completion

In this section, we consider the approach of the problem Eq. (4.1) for the second, underde-termined case (2) through alternating least squares. We continue to denote

L(N) = LN,L = L(δ,α) ∈ Rm×(n1×...×nd) ∼= Rm×nD .

The vector (node) y = y(ζ) ∈ Rm is assigned a singleton mode label ζ and interpretedas collection of m linear measurements. Analogous to Eq. (4.5), this leads to the tensorrecovery problem

find argminN∈Hα

‖LN − y‖ subject to rankαJ (N) ≤ r(J), J ∈ K. (4.20)

We denote the sought solution u as node M ,

u = M (α), M = M(α) ∈ Rn1×...×nd ∼= RnD .

In the noise-free case, we hence have y = LM . The unique solvability of this problem canbe related to the tensor restricted isometry property, as we discuss in Section 4.6. Note that

‖LN − y‖2 =

m∑

i=1

(L(ζ = i)N − y(ζ = i))2.

We consider the same decomposition of N corresponding to a hierarchical family K as inSection 4.3. The network L is as well assumed to be a tree tensor network with the exceptionof an additional leg Cζ := v ∈ VL | ζ ∈ m(Lv),

L = ζv∈VLLv, L6=v := ζw∈VL\vLw, (4.21)

whereas the graph (VL, EL) (without legs) still forms a tree. In Fig. 4.4, a typical caseCζ = v ∈ V | m(Nv) ∩ α 6= ∅ given V = VN = VL is shown. The node y is not furtherdecomposed.

4.5.1 Branch-Wise Evaluations for Equally Structured Networks

As in Eq. (4.9), we define copies of N and L with modified mode labels,

L′ := L(α 7→ α′), L′v := Lv(α 7→ α′, ε 7→ ε′), v ∈ VL. (4.22)

Each single node v ∈ VN again yields a linear subproblem and update

N+v := argmin

Nv

‖LN − y‖ = argminNv

‖(LN6=v)Nv − y‖.

Likewise, its solution is given through the following normal equation,

N ′6=v L′ LN 6=v N+v = N ′6=v L′ y. (4.23)

Here however, the contraction associated to the singleton mode label ζ cannot be partitionedinto smaller ones. The number of measurements d(ζ) = m may hence only be reasonablylarge.

Page 100: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

84 4.5. ALS for Tensor Recovery and Completion

N ′1 N ′2 N ′4

N ′5N ′6

L′1 L′2 L′3 L′4

L′5L′6

L1 L2 L3 L4

L5L6

N1 N2 N3 N4

N5N6

ζ

B(LN)

3,2

B(LN)

3,2′

B(LN)

3,4

B(LN)

3,4′

B(LN)

3,5

B(LN)

3,5′

Figure 4.4: Example for the left-hand side of Eq. (4.23) for V = VN = VL and the micro-step M(v) forv = 3. The hierarchical Tucker network N is 3-orthogonal. It further holds neighbor(v) = 2, 4, 5. Eachnode of L shares one leg corresponding to ζ. The nodes L3, L′3 are not connected with N3, N ′3, respectively,since they have no outer mode labels, i.e. m(N3) ∩α = ∅. All top nodes but N3 form Z3, all bottom nodes(but N ′3) form Z′3 as in Eq. (4.25).

As opposed to the situation for linear equations as for Section 4.1, it is not efficient tocombine each pair L′v ζ Lv because the mode ζ remains. We again assume that the graphsof L and N are the same, now with exception of the leg Cζ , i.e. VL = VN , EL = EN andLL \ Cζ = LN . Analogous recursions as in Section 4.3.2 hold true,

L 6=v N 6=v = ζh∈neighbor(v) B(LN)

v,h,

B(LN)

v,h := ζw∈branchv(h) S(LN)

w (4.24)

S(LN)

w := Lw Nw.

The renamed counterparts are given through B(LN)

v,h

′= B(N′L′)

v,h := ζw∈branchv(h) N′w L′w and

follow the same scheme. For each single node v ∈ V , Eq. (4.23) is then restated as follows:

Z ′6=v Z 6=v N+v = Z ′6=v y, (4.25)

Z 6=v = Z 6=v(ζ,γ) := LN 6=v = Lv ζh∈neighbor(v) B(LN)

v,h, (4.26)

for γ = m(Nv). Similarly to Section 4.4.1, there is an optimal order of contractions whichshows that the CG method is preferable. Here however, it is more efficient to first applyall factors in Z 6=v, and afterwards Z ′6=v. As before, the number of required CG steps, givena tolerance, depends on the iTRIP of the network N as we will discuss in Section 4.6.Furthermore, as in Section 4.4.2, the diagonal preconditioner can easily be calculated as

diagγ(Z ′6=v Z 6=v) = Z 6=v γζ Z 6=v, (4.27)

= (L′v(α′j 7→ αj)ζ,αj Lv) (ζh∈neighbor(v)B

(LN)

v,h

′(β′ih 7→ βih)ζ,βih B(LN)

v,h), (4.28)

where we again ignored the renaming ·′ when and after applying the diagonal operator.

Page 101: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 85

4.5.2 Tensor Completion

Tensor completion is a special case of the recovery problem as in Eq. (4.20). Here, basedon a subset of observed entries y = y(ζ) ∈ Rm (ignoring noise), one seeks to reconstruct theremaining, missing entries of u = M (α). This means that we have

y(ζ = i) = M(α = p(i)), i = 1, . . . ,m, (4.29)

for an index set

P = p(1), . . . , p(m) ⊂ Ωα = 1, . . . , n1 × . . .× 1, . . . , nd.In machine learning terminology, the pair (Mpp∈P , P ) is commonly called training orsampling set. Using such, we perform a least squares fit

find argminN∈Hα

p∈P(N(α = p)−M(α = p))2 subject to rankαJ (N) ≤ r(J), J ∈ K. (4.30)

We can express Eq. (4.29) with an operator L = L(ζ,α) ∈ 0, 1m×nD through y = LM .Its entries are given by

L(ζ = i,α = x) =

1 if x = p(i),

0 otherwise,(4.31)

for i = 1, . . . ,m and x ∈ Ωα. The operator L can be decomposed into a rank 1 representation

L = Lvv∈V (outer) , for nodes v(outer)j with outer mode labels (cf. Definition 4.1)

Lv(outer)j

= Lv(outer)j

(ζ, αj) ∈ 0, 1m×nj , j = 1, . . . , d,

where

Lv(outer)j

(ζ = i, αj = k) =

1 if k = p

(i)j

0 otherwise,

for i = 1, . . . ,m and k = 1, . . . , nj . All remaining nodes can be omitted or, for simplicity,

be set as Lv = Lv(ζ) ≡ 1, v /∈ V (outer), such that L = ζv∈V (outer)Lv = ζv∈V Lv. For each

fixed j ∈ D, the sets

[m](j,`)P := i ∈ 1, . . . ,m | p(i)

j = ` (4.32)

= i ∈ 1, . . . ,m | Lv(outer)j

(ζ = i, αj = `) = 1

are pairwise disjoint, for ` = 1, . . . , nj . Thereby, each slice number ` of Nv(outer)j

can be

optimized independently, i.e.

N+

v(outer)j

(αj = `)

= argminNv(outer)j

(αj=`)

‖(L(ζ ∈ [m](j,`)P , αj = `)N 6=v(outer)j

)Nv(outer)j

(αj = `)− y(ζ ∈ [m](j,`)P )‖.

For every i ∈ [m](j,`)P , with Z as in Eq. (4.26), we obtain

Z 6=v(outer)j

(ζ = i, αj = `) = (L6=v(outer)j

N6=v(outer)j)(ζ = i) · L

v(outer)j

(ζ = i , αj = `)︸ ︷︷ ︸

=1

= w 6=v(outer)j

(Lw(ζ = i)Nw)

= k∈D\j Nv(outer)k

(αk = p(i)k ) v/∈V (outer) Nv, (4.33)

Page 102: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

86 4.6. The Tensor Restricted Isometry Property

for D = 1, . . . , d. Analogously, for u /∈ V (outer), we have

Z 6=u(ζ = i, αj = `) = k∈D Nv(outer)k

(αk = p(i)k ) v/∈V (outer)∪u Nv.

The sparsity of the nodes within L can also be applied to effectively compute the branchproducts B(LN)

v,h defined in Eq. (4.24) as

B(LN)

v,h(ζ = i) = w∈branchv(h) S(LN)

w (ζ = i).

For the single terms, we in turn have

S(LN)

v(outer)j

(ζ = i) = Lv(outer)j

(ζ = i)Nv(outer)j

= Lv(outer)j

(ζ = i , αj = `)Nv(outer)j

(αj = `)

= Nv(outer)j

(αj = `), ∀i ∈ [m](j,`)P . (4.34)

Any multiplications with the operator L or parts of its decomposition can hence be performedimplicitly.

N ′1 N ′2 N ′4

N ′5N ′6

L′1 L′4

L′5L′6

L1 L4

L5L6

N1 N2 N3 N4

N5N6

ζ

B(NL)

3,2

B(NL)

3,2′

B(NL)

3,4

B(NL)

3,4′

B(NL)

3,5

B(NL)

3,5′

Figure 4.5: Example for the left-hand side of Eq. (4.23) for the special case of tensor completion Eq. (4.29),and the micro-step M(v) for v = 3. The hierarchical Tucker network N is 3-orthogonal. It further holdsneighbor(v) = 2, 4, 5. Each node of L shares one leg corresponding to ζ. There are for example no nodesL3 and L′3 since N3 has no outer mode labels, i.e. m(N3) ∩ α = ∅. All top nodes but N3 form Z3, allbottom nodes (but N ′3) form Z′3 as in Eq. (4.25).

4.6 The Tensor Restricted Isometry Property

The restricted isometry property (RIP), as it is originally introduced for matrix vectorproducts in [13], describes how close a matrix is to an isometry when restricted to operateonly on sparse vectors. It also appears within the framework of matrix recovery when

Page 103: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 87

approached through compressed sensing [87]. In that setting, singular values are assumedto be sparse, which implies a low rank. The following definition for the tensor restrictedisometry property (TRIP) can in its original formulation be found in [85].

Definition 4.3 (TRIP). We say the operator L (cf. Section 4.5) fulfills the TRIP withconstant 0 ≤ δr < 1 for a family K and rank r ∈ NK if there exists ρ > 0 such that

(1− δr)‖N‖2 ≤ ρ‖L(N)‖2 ≤ (1 + δr)‖N‖2 (4.35)

for all N ∈ Hα with rankαJ (N) = r(J) for all J ∈ K.

In common literature, the weight parameter ρ does not appear explicitly. Instead, theoperator L is (assumed to be) rescaled to yield an optimal constant δr. Naturally, thedefinition can be applied to any other linear operator as well. Concerning the representationof L by L = L(ζ,α), if L(ζ),(α) has full rank, it fulfills the TRIP with constant δr < 1 forevery rank r. The interesting case is given when m < nD = n1 . . . nd. The problem settingEq. (4.20) has a unique minimizer which hence equals M (cf. [85]), if the TRIP for L isfulfilled for rank 2r and it holds rJ = rankαJ (M) for all J ∈ K. This does however not implythat one can easily find that minimizer. Furthermore, mostly M will only approximately below-rank, or the measurements are disturbed by some noise η, such that stronger assumptionsare required.

4.6.1 Relation to Alternating CG

For a matrix B ∈ Rm×k, m > k, we define the condition number with respect to theEuclidean norm as the quotient of the largest and k-th singular value of B, i.e.

κ2(B) =σ1(B)

σk(B)=

max‖x‖2=1 ‖Bx‖2min‖x‖2=1 ‖Bx‖2

. (4.36)

In the to be solved linear minimization problem Eq. (4.25), it plays a significant role regard-ing the convergence speed of CG. Further, it is related to the restricted isometry property:

Proposition 4.4. Let the family K correspond to the network N (cf. Section 3.3.1). Let Nfurther be v-orthogonal. Then the TRIP of L (cf. Section 4.5) with constant δr implies

κ2(Z(ζ),(γ)6=v )2 ≤ 1 + δr

1− δr.

for Z 6=v := LN 6=v and γ = m(Nv) (cf. Eq. (4.26)).

Note that this bounds the condition number of W ′6=v = Z ′6=v Z 6=v as well, i.e.

κ2(W(γ),(γ)6=v ) ≤ 1 + δr

1− δr.

Accordingly, the restricted isometry property can also be expressed using the symmetricA = L′ L.

Proof. We abbreviate the TRIP Eq. (4.35) as (1− δr)c1 ≤ ρc2 ≤ (1 + δr)c1. Firstly, due tothe orthogonality constraints, we have c1 = ‖N‖ = ‖Nv‖. For

K(Nv) :=‖Z 6=v Nv‖2‖Nv‖2

=‖LN‖2‖N‖2 =

c2c1, (4.37)

Page 104: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

88 4.6. The Tensor Restricted Isometry Property

let a := maxNv 6=0K(Nv) and b := minNv 6=0K(Nv). Given the TRIP with constant δr, itfollows

κ2(Z(ζ),(γ)6=v )2 Def.

=a

b≤ ρ−1(1 + δr)

ρ−1(1− δr)=

1 + δr1− δr

,

which was to be shown.

The TRIP constant δr hence bounds the number of (not preconditioned) CG iterationsrequired to reach a certain tolerance in micro-step M(v) if N is v-orthogonal.

4.6.2 The Internal Tensor Restricted Isometry Property

For tensor completion (cf. Section 4.5.2), the TRIP is generally not fulfilled, since one caneasily find a low-rank N with LN = N |P ≡ 0. The condition number of Zb can howeverequivalently be formulated through a so-called internal restricted isometry property (cf.[42]GrKr19), which depends on the network N itself.

Definition 4.5 (Internal TRIP). We say N ∈ Hα fulfills the internal TRIP in mode v withrespect to the operator L and constant δv,N if there exists ρ > 0 such that

(1− δv,N )‖N 6=v Nv‖2 ≤ ρ‖L(N 6=v Nv)‖2 ≤ (1 + δv,N )‖N6=v Nv‖2

for all Nv = Nv(γ) with γ = m(Nv).

Note that the constant δv,N still depends indirectly on the rank of N and it holdsδv,N ≤ δr, for δr as in Definition 4.3. The internal restricted isometry property (iTRIP) isa much weaker assumption with likewise weaker implications, but it is closely related to thecondition number of Z 6=v.

Corollary 4.6. The iTRIP in mode v with constant δv,N as in Definition 4.5 is equivalentto

κ2(Z(ζ),(γ)6=v )2 ≤ 1 + δv,N

1− δv,Nprovided N is v-orthogonal.

Proof. The direction “⇒” is implied by Proposition 4.4 and δv,N ≤ δr. For the opposite

implication “⇐”, we define ρ =1−δv,N

b (cf. Eq. (4.37)). Then, using the same notation asin the proof of Proposition 4.4, we obtain

ρt2 ≤1− δv,N

bat1 ≤ (1 + δv,N )t1 and ρt2 ≥

1− δv,Nb

bt1 = (1− δv,N )t1,

which was to be shown.

Hence, the iTRIP is fulfilled for N if and only if the matrix Z(ζ),(γ)b has full column-rank,

or equivalently, the map Nb 7→ LN is injective.

Let j ∈ D be fix. In Definition 4.1, we introduced the set V (outer) of nodes that are con-tained in legs, such that N

v(outer)j

contains the (outer) mode label αj . Under mild conditions

towards the sampling set P , the iTRIP for the sampling operator (·) 7→ (·)|P is fulfilled forgeneric low-rank tensors:

Proposition 4.7 (Likelyhood of iTRIP). Let v ∈ V and γ := m(Nv). Further, if v =

v(outer)j , assume |[m]

(j,`)P | ≥ d(γ ∩β) for all ` = 1, . . . , nj, and if otherwise v /∈ V (outer), then

assume m = |P | ≥ d(γ). Then the iTRIP in mode v is fulfilled for a constant δr,N < 1for almost all rank r tensors N = w∈VNw (with respect to the Lebesgue-measure on therepresentation space Nww∈V ∈ D).

Page 105: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

4. Low-Rank Tensor Minimization Problems and Alternating CG 89

Proof. For j = 1, . . . , d, each Z 6=v(ζ ∈ [m](j,`)P , αj = `)(ζ),(γ∩β) ∈ R|[m]

(j,`)P |×d(γ∩β) generically

has full column rank (cf. Eq. (4.33)). Hence Z(ζ)6=v likewise has full column rank as the matrix

is block-diagonal up to a permutation of indices ζ ∈ [m](j,`)P , ` = 1, . . . , nj . The proof for

v /∈ V (outer) is analogous.

So the iTRIP, even for tensor completion, can usually be assumed to be fulfilled, yetpossibly for δv,N ≈ 1. We will further show for the matrix completion case in Proposition 5.16that it does not behave well under perturbation. Generalized, this means that low-ranktensors N without the iTRIP for the sampling operator can be densely scattered.

Page 106: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 107: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 5

Stable ALS for Rank-AdaptiveTensor Approximation andRecovery

In Chapter 4, in particular Section 4.5.2, we have discussed fixed-rank methods for tensorrecovery and completion. Adapting the multivariate, discrete rank of the unknown iterateis a vital part of such tasks, yet heuristics tend to be insufficient.We first introduce a concept of stability for model complexity calibration on the basis ofsimple matrix completion, which aims to resolve problems concerning rank adaption duringalternating least squares tensor completion. This initial part does not make use of tensornode arithmetic, while Sections 5.5 and 5.6 generalize these ideas to tree tensor networks.

5.1 Stability for Complexity Calibration in Iterative Op-timization Methods

Albeit rank adaption is simpler for matrices, as there is only one rank to find, it still exhibitsone major issue with ALS. Each single step of the optimization depends on the current rank,but ignores the singular values. Yet, the latter ones provide continuous, considerably moredistinguished levels of complexity.

5.1.1 Introduction through Matrix Completion

A simple instance of the situation in Section 4.5.2 is matrix completion. In this setting, onewants to recover a matrix M ∈ Rn1×n2 , which is only observable at

Mpp∈P for P = p(1), . . . , p(m) ⊂ Ω := 1, . . . , n1 × 1, . . . , n2.

The sampling set P is assumed to be given, and may not be enlarged.

We first assume once again that we know the approximate rank r ∈ N of M , and thatthis value is sufficiently small compared to the cardinality m of P . In other words, afterr entries, the singular values of M become sufficiently smaller. The minimization problemEq. (4.30) then becomes

find argminA∈Rn1×n2

‖A−M‖P subject to rank(A) ≤ r,

91

Page 108: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

92 5.1. Stability for Complexity Calibration in Iterative Optimization Methods

where ‖B‖2P :=∑p∈P B

2p for matrices B. We emphasize here that the alternating least

squares method relies on a specific, rank dependent data model defined through

Dr := Rn1×r × Rr×n2 , τr : Dr → Rn1×n2 , τr(X,Y ) = XY.

Similar to Remark 3.3, for each rank r, we call τr representation map and Dr the data space.For tensor node networks, this pair is implicitly given through the structure (cf. Remark 3.3)and rank constraints towards the tensor network N, whereas here, it plays a particular roleas function itself. Since every matrix has a unique rank, the target space Rn1×n2 can bepartitioned into

Rn1×n2 =⋃min(n1,n2)

r=0Tr, Tr := A ∈ Rn1×n2 | rank(A) = r,

and for T≤r = T≤r,K(Rn1×n2), K = 1, (cf. Definition 3.15) we have that (cf. Eq. (3.10))

range(τr) =⋃

r≤rTr = T≤r.

The minimization problem can, by means of A = τr(X,Y ), hence be restated as

find argmin(X,Y )∈Dr

‖τr(X,Y )−M‖P .

One sweep of ALS (cf. Eq. (4.12)) applies the two optimization methods M(1), M(2) (cf.Eq. (4.11)), called micro-steps, given through

M(1)r (X,Y ) := (argmin

X

‖XY −M‖P , Y ), (5.1)

M(2)r (X,Y ) := (X, argmin

Y

‖XY −M‖P ). (5.2)

Formally, for each value of the matrix rank r, each singleM(1)r ,M(2)

r is a different function.

For most realistic applications, one does not assume to know the approximate rank of Mas opposed to Chapter 4. The appropriate model complexity is a matter of the quality andmagnitude of P with respect to M . Yet in the general case, the missing structure of givendata hardly allows to obtain knowledge about this relation. Therefore, since overestimatingthe model complexity (i.e. the rank) ultimately leads to flawed results, a cautious learningprocess is required to adapt such. For example, [41, 102] suggest rank increasing strategies,slowly increasing the complexity. Due to the difficult, nonconvex nature of the problem, oneoften does not expect to be able to find the global minimizer for any one, fixed rank, but asatisfying local one.

Each adaption of the rank during the optimization causes a change between data spaces Dr.Intuitively, considering that the generated spaces Tr have pairwise distance 0 within Rn1×n2 ,one would want that a change of rank does not have large impact, given the problematicnature of overfitting. This, however, is not true for bothM(1),M(2), while arbitrarily smallperturbations of the iterate A = τr(X,Y ) may change its rank. In this context, the con-cept of a stability under calibration of model complexity for ill-posed, or underdetermined,systems is proposed in [42]GrKr19:

Page 109: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 93

Definition 5.1 (Stability [42]GrKr19). Let M be a method that for each rank r providesa function Mr : Dr → Dr (the optimization method for fixed rank). We define thefollowing properties:

• M is called representation independent, if τr(Mr(D)) = τr(Mr(D)) for all r

and D, D ∈ Dr with τr(D) = τr(D). We then define τ−1r to map to one possible

representation (circumventing the use of equivalence classes).

• M is called fixed-rank stable, if it is representation independent and for any fixedrank r, the map τr Mr τ−1

r : Tr → RΩ is continuous.

• M is called stable, if it is representation independent and the function

fM : RΩ → RΩ, fM(A) := τr(A) Mr(A) τ−1r(A)(A), (5.3)

where r(A) is the rank of A, is continuous.

A ∈ RΩ rrank

∈ Dr

datamodel

D

D

τr

Mr

MMr(D)

Mr(D)

data space :

τr

whole space : fM(A)fM

Figure 5.1: ([42]GrKr19) The diagram depicting Definition 5.1. Magenta part: depending on the rank of

A, the method M provides a specific mapping Mr to be applied to equivalent representations D, D ∈ Dr.Teal part: representation independent states that fM is well-defined since both lower paths from A alongthe data space result in the same output within the whole space. Stability requires that this function, theupper path, is continuous.

For better readability, we will mostly skip the index r for the fixed-rank methodsMr exceptif we want to emphasize on it. Although we introduce this definition of stability through ma-trix completion, it can be applied to different situations, subject to nested spaces for whichTr ⊂ Tr whenever r ≤ r (possibly an entrywise inequality). In particular, the representationmap τr and data space Dr may be given as in Remark 3.3, for a rank r = ree∈E , and themethod may be any micro-step as in the more general situation in Section 4.5.

The adaption of a discrete model complexity for unstable methods necessarily remains adiscrete problem, since a continuous transition in the sense of Definition 5.1 into a morecomplex model is not possible. This is not inevitably a problem, such as for overdeterminedminimization problems. Yet properly calibrating the rank for unstable methods in the con-text of ill-posed inverse problems is likely to lead to complications, such as for completion.While even in that case, many aspects of the optimization are stable, this does not hold truefor the micro-steps M(1) and M(2):

Example 5.2 (Instability of alternating least squares matrix completion steps [42]GrKr19).Let a ∈ R \ 0, 1 be a possibly very small parameter. We consider the target matrix M and

Page 110: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

94 5.1. Stability for Complexity Calibration in Iterative Optimization Methods

an ε-dependent initial approximation A = A(ε)

M :=

? 1.1 0.91 1 1.1

1.1 1 1

, A(ε) :=

1 1 11 1 11 1 1

+ ε

0.5 + a 0.5 + a −a1 + a 1 + a −1− a1− a 1− a −1 + a

,

where the entry M1,1 (the question mark above) is not known or given. The matrix M isof rank 3 and A(ε) is of rank r = 1 for ε = 0 and of rank r = 2 otherwise. We seek a bestapproximation of (at most) rank 2 in the least squares sense for the known entries of M . Ina single ALS step, as defined by Eq. (5.2), we replace Y (ε) of the low-rank representationA(ε) = X(ε)Y (ε) by the local minimizer, where in this case

A(0) =

111

(1 1 1

), A(ε) =

1 0.5 + a1 1 + a1 1− a

(

1 1 1ε ε −ε

)if ε > 0.

This optimization yields a new matrix, B(ε) = fM(2)(A(ε)) = τr M(2)r τ−1

r (A(ε)) (inde-pendently of the chosen representation), given by

B(0) =

1.05 ∗ ∗1.05 ∗ ∗1.05 ∗ ∗

, B(ε) =

1 + 140a ∗ ∗

1.0 ∗ ∗1.1 ∗ ∗

if ε > 0. (∗ is some value)

Now let a be fixed and let ε tend to zero so that the initial guess A(ε) → A(0). However,B(ε) 9 B(0), thus violating the stability. Furthermore, the rank two approximation B(ε),given an arbitrary, fixed ε > 0, diverges as a→ 0, in particular it is not convergent althoughthe initial guess A(ε) converges to a rank two matrix as a→ 0. Thus, the microstep is noteven stable for fixed rank. We want to stress that the initial guess is bounded for all ε, a ∈(0, 1), but the difference between B(0) and B(ε) is unbounded for a→ 0 (cf. Definition 4.5).The unboundedness can be remedied by adding a regularization term in the least squaresfunctional, e.g. +‖XY ‖, but the ALS step remains unstable.

This example likewise demonstrates that ALS for tensor completion is not stable andthus, as discussed before, problematic when adapting the rank. The following Example 5.3further shows that this is not a marginal phenomenon, but occurs systematically during anyrank change.

Example 5.3 (ALS for ill-posed, inverse problems is unstable [42]GrKr19). Consider themicrostep M(2) as in Eq. (5.2). Let U ∈ Rn×r, V ∈ Rm×r be orthogonal, such that UΣV T

is a truncated SVD of a rank r matrix A = τr(U,ΣVT ) ∈ Rn×m. We now let σr → 0,

σr > 0, such that in the limit A∗ := A|σr=0 has rank r − 1. The update is independent ofthis last singular value though:

fM(2)(A) = τr(M(2)r (U,ΣV T )) = U argmin

Y‖UY −M‖P = lim

ε0fM(2)(A|σr=ε) (5.4)

However, if σr = 0, then A|σr=0 has rank r−1 and a truncated SVD Ur−1Σr−1VTr−1. Hence,

the update

fM(2)(A|σr=0) = τr−1(M(2)r−1(Ur−1,Σr−1V

Tr−1)) = Ur−1 argmin

Y‖Ur−1Y −M‖P

is in general different from the limit Eq. (5.4), given that the range of U is different fromUr−1. The same holds for an analogous update M(1) of X = UΣ. Note that these updatesare indeed representation independent.

The micro-steps of ALS for tensor completion behave analogously, but with multipletuples of singular values adjacent to one node, or core (cf. Section 3.4).

Page 111: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 95

5.1.2 The Importance of Stability for Iterative Fixed-Rank Meth-ods

Iterative (multilinear) methods such as alternating least squares as well as Riemannianmethods do not change the rank of the iterate themselves. The problem with instabilitycan be demonstrated through the following tensor, showing that heuristic rank adaptionmethods are likely to fail. More general, Section 8.4.2 shows that the relation betweendecay rates of different TT-singular value tuples is very weak. In other words, almost anydistribution of singular values is theoretically possible.

Example 5.4 (Rank adaption test tensor [42]GrKr19). For k ∈ N, a, b > 0, let q ∈ Rn1×...×n4

be an orthogonally decomposable 4-dimensional tensor with TT-rank (k, k, k) and uniformsingular values σ(1) = σ(2) = σ(3) = (a, a, . . .) (cf. Definition 3.18). Further, let b ∈ Rn5×n6

be a rank 2k matrix with exponentially decaying singular values σ(5) ∝ (b−1, b−2, . . .).

Then the separable tensor t ∈ Rn1×...×n6 defined by

ti = qi1,...,i4 · bi5,i6 , i = (i1, . . . , i6) ∈6×

µ=1

1, . . . , nµ,

has TT-singular values σ and rank r(t) = (k, k, k, 1, 2k).

By its definition, t can be decomposed into a 4- and a 2-dimensional tensor (q, b).

Moreover, for ` := min(k1, k2, k3) < k, any low-rank approximation to q with nonuniformrank is as good as the next best uniform one. In explicit,

minq∈Rn1×...×n4 , rankTT(q)=(k1,k2,k3)

‖q − q‖ = minq∈Rn1×...×n4 , rankTT(q)=(`,`,`)

‖q − q‖.

Knowing this would of course drastically simplify the problem. We now consider the per-formance of two very basic rank adaption schemes.

1. Greedy, single rank increases: We test for maximal improvement by increase of one ofthe ranks rµ (µ = 1, . . . , d− 1) of the iterate starting from r ≡ 1.

Solely increasing either of r2, r3 or r4 will give close to no improvement as demonstratedabove. As further shown in [29], the approximation of orthogonally decomposable ten-sors with lower rank can be problematic. In numerical tests (cf. Section 5.7), we canobserve that r5 is often increased to a maximum first. Thereby, extremely small singu-lar values are involved that lie far beneath the current approximation error, althoughthe rank is not actually overestimated.

2. Uniform rank increases and coarsening : We increase every rank rµ (µ = 1, . . . , d− 1)starting from r ≡ 1 and decrease ranks when the corresponding singular values arebelow a threshold.

The problem with this strategy is quite plain, namely that for the target tensor t,

it holds r(t)5 = 1. If this TT-rank is overestimated, the observed sampling points

will be misinterpreted (overfitting) and it does not matter how small correspondingsingular values become (see Example 5.3).

An explicit construction of such a tensor can be found in [42]GrKr19. One might as well try toincrease several ranks at once. Yet the potential improvement then needs to be determinedfor each of these combinations. This is again a multivariate combinatorial problem, whichcannot be reduced to single rank increases, as this would merely correspond to the firststrategy above. It hence appears unfavorable to separate the optimization method from therank adaption. Contrarily, both aspect need to be combined into a single approach.

Page 112: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

96 5.2. Stable Alternating Least Squares Micro-Steps for Matrix Completion

5.1.3 Overview over Rank Adaption Strategies in Literature

Not many rank adaption strategies for fixed-rank tensor methods in the context of underde-termined systems appear in literature. One of the earliest works for rank-adaptive matrixcompletion is [102], which suggests to start with a low rank (complexity) and to increasesuch until overfitting occurs. While [41] follows this idea, uniform ranks are assumed, whichsimplifies the adaption problem substantially. [94] presents a more elaborate scheme fortheir algorithm called Rttc. However, the optimization is likewise not stable in the senseof Definition 5.1, and the algorithm suffers from this in cases as for example considered inExample 5.4 (see the comparison in [42]GrKr19). The rank adaptation presented in [44] is asophisticated, largely heuristic greedy strategy, but the resulting algorithm is also not sta-ble. The numerical tests so far provided are unfortunately only for a 6-dimensional functionwhich does not depend on two of its variables (so in effect it is four-dimensional). While inthe shown example, the adapted rank successfully recovers the corresponding ranks equalto one, all other ranks are still uniform. So it is not clear whether just the function requiresa uniform rank adaption, or if this is a weakness of the presented strategy. Further, thesampling set used in the test is comparatively large.

Rank adaption certainly does not need to be separate from the actual optimization, andmay in that sense not appear as supplemental problem. In nuclear norm minimization, therank constraint becomes the target function itself in form of the `1-norm of the singularvalues. Also in reweighted least squares matrix completion [16,19], the approximation itselfis different (cf. Section 5.4.1). We follow a related approach in order to rid ourselves of thenecessity of an additional rank adaption heuristic.

5.2 Stable Alternating Least Squares Micro-Steps forMatrix Completion

In this section, we consider how to stabilize the ALS micro-steps in a little invasive waythrough means of averaging. Alternatively, as we discuss in Section 5.4, a very similar resultcan be derived based on so-called reweighted least squares, which stems from nuclear normminimization. The latter approach is more direct and simple, but on the other hand, it doesnot naturally provide problem dependent scaling factors.

5.2.1 Stability through Convolution

In the setting of matrix recovery, where the restricted isometry property (RIP) may holdfor low ranks (cf. Section 4.6), [99] (as one example for many) gives criteria towards theRIP constant for the existence of a unique minimizer. Even though the RIP does not holdtrue for tensor completion, one expects at least a similar behavior. The further the rankis increased, the more erroneous minima appear, an effect essentially known as overfitting.Rank increasing strategies have the aim to avoid this dilemma and are combined with cross-verification methods or similar in order to determine appropriate termination criteria. Yetstill, the transition from one to a higher rank is problematic in particular if the rank is multi-variate, as we have discussed in the previous sections. In the worst case, even an arbitrarilysmall perturbation of the iterate may lead to an arbitrarily large error of the completion, asdemonstrated in Example 5.2.

Instead of the discrete rank, the magnitudes of singular values are more meaningful, yetALS ignores such completely (cf. Example 5.3). The approach we consider here is an av-eraging of the micro-step results, which in a certain sense assumes an uncertainty on the

Page 113: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 97

current iterate. We will see that this way, the level of regularization can be adapted con-tinuously. Thereby, the complexity of the model is no longer determined by a discrete,multivariate rank, but one real-valued parameter that is put into relation with the singularvalues of the iterate. While optimization methods usually keep distance from the border ofone manifold of a fixed rank, possibly through an additional regularization, we contrarilyaim to optimize continuously between manifolds of different ranks.

We consider one possibility to stabilize M(2) (cf. Eq. (5.2)) in the matrix case. For theremainder of the section, we abbreviate this micro-step as M. For a parameter ω > 0 andan upper bound r, let the function gω : Tr → R be defined as

gω : H →

1 if ‖H‖F ≤ ω,0 otherwise .

The function fM : Rn1×n2 → Rn1×n2 as in Definition 5.1 is not continuous. Assuming itslocal integrability, the convolution

f∗M := fM Tr∗ gω (5.5)

given by

f∗M(A) = EH∈VA,ω fM(H) =1

|VA,ω|

VA,ωfM(H) dH, (5.6)

VA,ω := H ∈ Tr | ‖H −A‖F ≤ ω,in turn is continuous within Tr (cf. Fig. 5.2). This function however does not preserve thelow-rank structure. Therefore, there does not exist a method M∗ for which f∗M = fM∗ .Apart from that, as it appears, it is too complicated to evaluate. Still, the simplificationsin the following remain subject to this motivation.

B1

B2

B3

C

fM(Bi)

fM(C)

A

fM(A)

Tr

Tr

VA,ω

I CB1

B2

B3

f ∗M(B1)

f ∗M(B2)

f ∗M(B3)

f ∗M(C)

A

f ∗M(A)

Tr

Tr

Figure 5.2: ([42]GrKr19) The schematic display of the unstable function fM (left) and the variational, stablef∗M (right). In both pictures, the range of τr is depicted as black curve contained in the range of τr shownas blue area (with magenta boundary). A is a rank r element, while C and each Bi has rank r. Left:Regardless of their distance to A, the tensors B1, B2 and B3 (and any other point of the dotted line exceptthe lower rank element A) are mapped to the same point fM(Bi). Likewise, C is, although as close toA as B1, mapped to a completely different point. The teal circle exemplarily shows one possible range ofaveraging at the point A. Right: If an element (such as B1 and C) is close to A, then this also holds fortheir function values. However, f∗M(A) is not rank r anymore (in fact, the range of f∗M is generally noteven rank r).

5.2.2 Exemplary Considerations regarding Stability

We demonstrate that certain simplifications still yield the same effect as in Section 5.2.1.Contrarily to the previous, rough motivation, we here consider examples that are narrowed

Page 114: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

98 5.2. Stable Alternating Least Squares Micro-Steps for Matrix Completion

down to a point that allows for simpler, explicit calculations. Let therefore A = UΣV T bean SVD of there current iterate within an ALS iteration. We limit the perturbation of A toonly the second column of U , as it is approximately the case if σ1 σ2 ≈ ω and V is fixed:

Lemma 5.5 (Variational low-rank matrix approximation [42]GrKr19). Let M be defined byEq. (5.2) and P = Ω (full sampling). Further, let A = UΣV T ∈ Rn1×n2 be of rank two,given by its SVD components U = (u1 | u2), Σ = diag(σ1, σ2) and V as well as M ∈ Rn1×n2

arbitrary and 0 < ω <√

2σ2. Then

fM(A) :=1

|Vω|

fM((u1 | u2 + ∆u2)ΣV T

)d∆u2

= u1uT1 M︸ ︷︷ ︸

optimization

+ (1− αω)2u2uT2 M︸ ︷︷ ︸

regularization

+2αω − α2

ω

n1 − 2(In1− u1u

T1 − u2u

T2 )M

︸ ︷︷ ︸replenishment

, (5.7)

Vω :=

∆u2 | ‖(u1 | u2 + ∆u2)ΣV T −A‖F = ω, (u1 | u2 + ∆u2) has orthonormal columns

for αω = ω2

2σ22

, such that αω → 1 if ω →√

2σ2. Alternatively, considering complete uncer-

tainty concerning the second singular vector, we obtain

1

|Vω|

fM((u1 | ∆u2)ΣV T

)d∆u2 = u1u

T1 M +

1

n1 − 1(In1 − u1u

T1 )M,

where here Vω := ∆u2 | (u1 | ∆u2) has orthonormal columns.

Proof. ([42]GrKr19) We parameterize Vω. First, ω = ‖(u1 | u2+∆u2)ΣV T−A‖F = ‖∆u2‖F σ2

and hence ‖∆u2‖F = ωσ2

. By orthogonality conditions, we obtain ∆u2 = −αωu2 + ∆u⊥2with ∆u⊥2 ⊥ range(U), ‖∆u⊥2 ‖ = βω. The two constant αω and βω are determined by theequations

α2ω + β2

ω =ω2

σ22

,

(1− αω)2 + β2ω = 1,

by which αω = ω2

2σ22, βω =

√2αω − α2

ω. Hence, Vω is an (m − 3)-sphere of radius βω =√ω2

σ22− α2

ω, that is

Vω = −αωu2 +HβωSm−2,

for a column-orthogonal matrix H : Rn1×n1−2 with range(H) ⊥ range(U). Thus, HHT =In1 − u1u

T1 − u2u

T2 . The update for each instance of ∆u⊥2 is given by

fM((u1 | u2 + ∆u2)ΣV T ) = (u1 | u2 + ∆u2)(u1 | u2 + ∆u2)TM.

We integrate this over Vω and obtain∫

fM =

u1uT1 M +

(1− αω)2u2uT2 M +

∆u⊥2 ∆u⊥2TM,

since all integrals of summands which contain ∆u⊥2 exactly once vanish due to symmetry.We can simplify the last summand with Lemma 5.10 to

∆u⊥2 ∆u⊥2TM =

βωSm−2

(Hx)(Hx)TM dx = HHT 2αω − α2ω

m− 2|Vω|M.

Page 115: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 99

One can then conclude that HHT = Im − u1uT1 − u2u

T2 , since the rank of H is m − 2 and

range(H) ⊥ range(U). The division by |Vω| then finishes the first part. The second part isanalogous.

The influence of the second singular vector u2 is filtered out, the larger ω becomes, untilthe update resembles the same which the rank 1-approximation to A would yield1. Althoughthe result fM(A) in Eq. (5.7) is not low-rank, it is close to the rank 2 approximationU(uT1 M | (1 − αω)2uT2 M). Furthermore, the first component U has remained the same.Still, the model as in Lemma 5.5 is too complicated, and we modify it one more time:

Lemma 5.6 (Low-rank matrix approximation using a variational residual map [42]GrKr19).In the situation of Lemma 5.5, we have

argminV

1

|Vω|

‖(u1 | u2 + ∆u2)V −M‖2F d∆u2 = (uT1 M | (1− αω)uT2 M) (5.8)

Proof. ([42]GrKr19) Let V be the minimizer of the left-hand side of Eq. (5.8). Using thenormal equation, we obtain

|Vω|∫

(u1 | u2 + ∆u2)T (u1 | u2 + ∆u2) d∆u2V =

(u1 | u2 + ∆u2)TM d∆u2.

Since (u1 | u2 + ∆u2) is column-orthogonal for each ∆u2, this is equivalent to

V =

(uT1 M |

1

|Vω|

(u2 + ∆u2)TM d∆u2

)

=

(uT1 M | uT2 M +

1

|Vω|

−αωuT2 M + ∆u⊥2TMd∆u2

)

=(uT1 M | (1− αω)uT2 M

),

where the same argumentation as in the proof of Lemma 5.5 has been applied. Further, weused that

∫Vω

∆u⊥2 d∆u2 = 0 due to symmetry.

The updated matrix UV given through Eq. (5.8) is close to a rank 2 approximation of

fM(A) as in Eq. (5.7), where the factor (1−αω)2 has been replaced by (1−αω). Despite thedrastic simplifications applied in this section, the important observation is that the model inLemma 5.6 achieves a very similar effect as the convolution Eq. (5.5). Since it is furthermoregiven through a quite natural process, it serves as draft for the variational residual functionin Definition 5.7.

5.2.3 Variational Residual Function for Matrix Completion

We adapt the target function of the micro-steps M(1) and M(2) described in Section 5.1.1to obtain stable methods (M(1))∗ and (M(2))∗. As the have discussed in Section 5.2.1,instead of the one update given through the current iterate A, the idea to average over aneighborhood yields a good candidate for a stable method. Further, Section 5.2.2 showshow this might be achieved through more simple approaches. Similar as in Section 5.2.2,for r = rank(A), let

Vω(A) := ∆A | A+ ∆A ∈ Tr, ‖∆A‖F ≤ ω.

1Note that we fixed ‖∆u2‖F = ω for simplicity as well as that for ω >√

2σ, Lemma 5.5 does not makesense. Allowing perturbations up to a magnitude ω will prohibit that the influence of u2 vanishes completely,hence u2 is never actually truncated.

Page 116: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

100 5.2. Stable Alternating Least Squares Micro-Steps for Matrix Completion

Further, let A = τr(X,Y ) and (∆X,∆Y ) be such that A+ ∆A = τr(X +ω∆X,Y +ω∆Y ).Then

‖∆A‖2F = ‖(X + ω∆X)(Y + ω∆Y )−XY ‖2F= ‖ω(∆XY +X∆Y )‖2F +

(O(ω2)

)2. (5.9)

The term ‖∆XY + X∆Y ‖2F can be approximated, assuming the angles between the threesummands are small, by ‖∆XY ‖2F + ‖X∆Y ‖2F . This and Lemma 5.6 then motivate thefollowing definition.

Definition 5.7 (Variational residual function [42]GrKr19). Let ω ≥ 0, M be the target matrix,P be the sampling set and let A = XY be the current iterate. We define the variationalresidual function C := CM,P,X,Y : Dr → R by

C(X, Y ) :=

Vω(X,Y )

‖(X + ∆X)(Y + ∆Y )−M‖2P d∆X d∆Y, (5.10)

Vω(X,Y ) := (∆X,∆Y ) | ‖∆XY ‖2F + ‖X∆Y ‖2F ≤ ω2.

It is important to note that Vω does not depend on the unknown X, Y , respectively, buton the current iterate.

5.2.4 Minimizer of the Variational Residual Function for Matrices

We define our modified methods as

(M(1))∗(X, Y ) := (argminX

C(X, Y ), Y ), (5.11)

(M(2))∗(X, Y ) := (X, argminY

C(X, Y )), (5.12)

with C = CM,P,X,Y as in Eq. (5.10). We will later see that each minimizer is unique, but forformality we again use the minimization of the Frobenius norm of the iterate as secondarycriterion. The low-rank decomposition of a rank r matrix A is not unique, since

τr(X,Y ) = τr(X, Y ) ⇔ X = XT, Y = T−1Y, (5.13)

for an invertible matrix T ∈ Rr×r. The first step towards stability is the following result.

Proposition 5.8 ([42]GrKr19). The two methods (M(1))∗ Eq. (5.11) and (M(2))∗ Eq. (5.12)are representation independent.

Proof. Let Y + := argminY C(X, Y ) for C = CM,P,X,Y as in Definition 5.7 as well as Y + :=

argminY C(X, Y ) for C = CM,P,X,Y for two equivalent representations τr(X,Y ) = A =

τr(X, Y ). There is hence an invertible matrix T ∈ Rr×r such that Eq. (5.13) holds true.Thus,

C(X, Y ) =

Vω(X,Y )

‖(X + ∆XT−1)(T Y + T∆Y )−M‖2P d∆X d∆Y ,

Vω(X, Y ) := (∆X,∆Y ) | ‖∆XT−1Y ‖2F + ‖XT∆Y ‖2F ≤ ω2.

The substitution (∆X,∆Y )ι7→ (∆XT, T−1∆Y ) introduces a constant Jacobi Determinant

|det(Jι)| = 1. We obtain

C(X, Y ) =

Vω(X,Y )

‖(X + ∆X)(T Y + ∆Y )−M‖2P d∆X d∆Y.

Hence T Y + = Y +, by which τr(X, Y+) = τr(XT, T

−1Y +) = τr(X,Y+). This was to be

shown.

Page 117: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 101

We require some notations for matrices:

Definition 5.9 (Restrictions). For a matrix A ∈ RΩ and index set S ⊂ Ω, we use A|S ∈RS ∼= R|S| as restriction to that set. For a matrix M , let M:,i be its i-th column and Mi,: beits i-th row.

In order to calculate the minimizer of the variational residual function, we require thefollowing assertion, which we already applied in the motivational Section 5.2.2.

Lemma 5.10 (Integral over all variations [42]GrKr19). Let p, q ∈ N, ω ≥ 0 and H ∈ Rp×p bea matrix as well as

V := X ∈ Rp×q | ‖X‖F = ω.

Then ∫

V

XTHX dX =ω2|V |pq

trace(H)Iq, |V | :=∫

V

1.

Proof. ([42]GrKr19) Let S be the result of the above integral. Then

Sij = trace(Sij) =

V

trace(XT:,iHX:,j) dX = trace(H

V

X:,jXT:,i) dX.

Due to symmetry, for some a ∈ R, we have

V

vec(X)vec(X)T dX = aIpq,

V

vec(X)Tvec(X) dX = ω2|V |.

Since the second term is the trace of the first one, it follows that a = ω2|V |/(pq). We canhence simplify

Sij =

ω2|V |/(pq) trace(H) if i = j,

0 otherwise,

which is the to be proven statement.

Corollary 5.11. In the situation of Lemma 5.10, let instead

V := X ∈ Rp×q | X is column-orthogonal .

Then

EX∈V XTHX :=1

|V |

V

XTHX dX =trace(H)

pIq.

Proof. The proof is analogous to the one of Lemma 5.10 for ω = ‖X‖F =√q.

We now derive the minimizer of the variational residual function for matrices definedthrough Eq. (5.10). Through the orthogonality constraints of the essentially unique SVD(cf. Theorem 3.16), Vω as in Definition 5.7 takes the easier forms

Vω(U Σ, V T ) = (∆X,∆Y ) | ‖∆XV T ‖2F + ‖UΣ∆Y ‖2F ≤ ω2= (∆X,∆Y ) | ‖∆X‖2F + ‖Σ∆Y ‖2F ≤ ω2,

Vω(U, Σ V T ) = (∆X,∆Y ) | ‖∆XΣ‖2F + ‖∆Y ‖2F ≤ ω2.

Page 118: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

102 5.3. Stability and Rank Adaption

Theorem 5.12 (Minimizer of the ALS variational residual function for matrices [42]GrKr19).Let A ∈ Rn1×n2 be the current iterate with r = rank(A) and let UΣV T be a truncated SVD,U ∈ Rn1×r, Σ ∈ Rr×r, V T ∈ Rr×n2 , of A. Further, let C be the variational residual functionas in Eq. (5.10). The minimizer X+ of X 7→ CM,P,UΣ,V T (X, V T ) is given by

X+i,: = argmin

Xi,:

‖Xi,:VT −Mi,:‖2Pi,:︸ ︷︷ ︸

standard ALS

+|Pi,:|n2

ω2 ζ2 ‖Xi,:Σ−1‖2F

︸ ︷︷ ︸regularization

,

where Pi,: := p(k)2 | p(k)

1 = i, k = 1, . . . ,m is the corresponding part2 of the index set P .

The minimizer Y + of Y 7→ CM,P,U,ΣV T (U, Y ) is given by

Y +:,j = argmin

Y:,j

‖U Y:,j −M:,j‖2P:,j︸ ︷︷ ︸standard ALS

+|P:,j |n1

ω2 ζ1 ‖Σ−1 Y:,j‖2F︸ ︷︷ ︸

regularization

,

where P:,j := p(k)1 | p(k)

2 = j, k = 1, . . . ,m. The constants ζ1 and ζ2 only depend on theproportions of the representation and sampling set (cf. Remark 5.13).

The factors|Pi,:|n2

and|P:,j |n1

normalize the penalty terms to the particular magnitudesof the corresponding shares of the sampling set P and hence the standard ALS part. Thefactors ζ1, ζ2 ∈ (0, 1) are in turn independent of i, j, respectively. The ratio of both termsequals the ratio of the mode sizes n1, n2.

Proof. Despite the symmetry arguments in Lemma 5.10, the proof is still technical. It canbe found in [42]GrKr19.

Remark 5.13 (Specification of constants). Let #X := n1r, #Y := rn2 be the sizes of thecomponents in the matrix decomposition. The constants in Theorem 5.12 are given by

ζ2 =#Y

r(#X + #Y + 2), ζ1 =

#X

r(#X + #Y + 2).

However, when changing the rank, this would impose a slight offset in continuity of bothfM(1)∗ and fM(2)∗ . This problem is resolved by substituting ω by ω properly for each valuer. In [42]GrKr19, the substitution

ω2ζ2 = ω2 n2

n1 + n2, ω2ζ1 = ω2 n1

n1 + n2

is chosen. Despite this replacement, one still just writes ω.

We will see in Section 5.4 that there are also certain arguments to simply enforce ζ1 = ζ2 = 1.

The manipulation of ω through Remark 5.13 does not change the representation inde-pendency of (M(1))∗ and (M(2))∗, since for fixed rank, the parameter ω > 0 is merely aconstant.

5.3 Stability and Rank Adaption

In this section, we discuss the relation between stability and implicit rank adaption, andprovide algorithmic details to the so-called stable alternating least squares approximation(Salsa).

2Related to the more general situation in Section 4.5.2, i.e. Eq. (4.32), we have Pi,: = p(k)2 ∈ R | k ∈

[m](1,i)P and P:,j = p(k)

1 ∈ R | k ∈ [m](2,j)P .

Page 119: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 103

5.3.1 Stability of Regularized Micro-Steps

In order to prove the stability of the two regularized methods (M(1))∗ and (M(2))∗, werequire one last assertion.

Lemma 5.14 (Partial matrix inverse by divergent parts [42]GrKr19). We partition the index

set 1, . . . , n = ωj∪ωcj (ωcj = 1, . . . , n\ωj), j = 1, 2 and define Ω := ω1×ω2, Ω := ωc1×ωc2.

Let A(k)k, J (k)k ⊂ Rn×n be series of symmetric matrices, supp(J (k)) ⊂ Ω.If limk→∞A(k)|Ω = A|Ω, A|Ω s.p.d, and σmin(J (k)|Ω) → ∞, then V := limk→∞(A(k) +

J (k))−1 exists and we have V |Ω = (A|Ω)−1 and V |Ωc = 0 (Ωc = 1, . . . , n2 \ Ω).

Proof. ([42]GrKr19) First, w.l.o.g., let Ω = m + 1, . . . , n2. Otherwise we can apply permu-tations. Further, let V (k) := A(k) + J (k). We partition our (symmetric) matrices M forM1,1 ∈ Rm×m block-wise as

M =

(M1,1 M1,2

MT1,2 M2,2

).

Note that J(k)1,1 , J

(k)1,2 ≡ 0. Since A

(k)1,1 = V

(k)1,1 and A1,1 = A|Ω is s.p.d, A

(k)1,1 is invertible

for all k > K for some K and hence limk→∞(V(k)1,1 )−1 = A−1

1,1. Further, σmin(B(k)2,2 ) >

σmin(J(k)2,2 )−σmax(A

(k)2,2)→∞ and hence ‖(V (k)

2,2 )−1‖ → 0. Therefore, for k > K and H(k) :=

V(k)1,1 − V

(k)1,2 (V

(k)2,2 )−1(V

(k)1,2 )T , it is σmin(H(k)) > σmin(A1,1)/2. By block-wise inversion of

V (k), it then follows ((V (k))−1)1,1 = (H(k))−1 → (A(k)1,1)−1. Similarly, ((V (k))−1)|Ω → 0.

The restricted isometry property (RIP) (cf. [13]) is not fulfilled for the sampling operator(·) 7→ (·)|P in the case of matrix completion. In the tensor case, we have instead consideredthe internal TRIP (cf. Definitions 4.3 and 4.5), which we can also translate to this simplersetting:

We say A = τr(X,Y ) fulfills the iRIP in the first component with respect to the samplingoperator (·) 7→ (·)|P and constant δA if there exists ρ > 0 such that

(1− δA)‖τr(X, Y )‖2F ≤ ρ‖τr(X, Y )‖2P ≤ (1 + δA)‖τr(X, Y )‖2F , ∀ X ∈ Rn1×r. (5.14)

For the second component, this condition is replaced by

(1− δA)‖τr(X, Y )‖2F ≤ ρ‖τr(X, Y )‖2P ≤ (1 + δA)‖τr(X, Y )‖2F , ∀ Y ∈ Rr×n2 . (5.15)

Also Corollary 4.6 has a simple analog given an SVD A = UΣV T . The iRIP in the firstcomponent with constant δA is equivalent to

κ2(diag((V T ):,P:,1, . . . , (V T ):,P:,n2

)) ≤ 1 + δA1− δA

.

Each (V T ):,P:,1 is hence required to have full row-rank in order for the iRIP to be fulfilled.For the second component, this condition is replaced by

κ2(diag(UP:,1,:, . . . , UP:,n1 ,:)) ≤ 1 + δA

1− δA.

Each UP:,j ,: is hence required to have full column-rank.

Theorem 5.15 (Stability of the variational matrix methods [42]GrKr19). The regularizedmethods (M(1))∗ Eq. (5.11) and (M(2))∗ Eq. (5.12) are stable (for ω > 0 and ζ1, ζ2 > 0which do not depend on the rank).

The unregularized method (ω = 0) provides stability only for fixed rank (cf. Exam-

Page 120: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

104 5.3. Stability and Rank Adaption

ple 5.3), and only at points A∗ that fulfill the iRIP as above (in the first (Eq. (5.14))or second (Eq. (5.15)) component, respectively).

There are some technicalities involved in the proof of Theorem 5.15. We first considerthe case P = Ω (analogous to Section 5.2.2), for which the simplicity of the idea becomesapparent. For a certain constant c ∈ R, in the setting of Theorem 5.12, we have

fM(2)∗(UΣV T ) = U · (I + cΣ−2)−1

︸ ︷︷ ︸regularization

· UT M︸ ︷︷ ︸standard ALS

(5.16)

= ( (1 + cσ−21 )−1 · U:,1U

T:,1M, . . . , (1 + cσ−2

r )−1 · U:,rUT:,rM ).

If now σr → 0, then also (1+cσ−2r )−1 → 0 and we obtain the same result as if we would have

truncated the representation (X,Y ) = (U,ΣV T ) to rank r−1 beforehand. These additionalfactors hence filter out influence corresponding to low singular values. Note that obtainingsmall singular values is not penalized, but using components corresponding to small ones is.

Proof. (of Theorem 5.15). We will only show the proof for (M(2))∗, since the other case isanalogous.Fixed-rank stability: We first show that (M(2))∗ is fixed-rank stable. Let Aii ⊂ Rn1×n2

be a sequence of matrices with rank(Ai) = r∗ and Ai → A∗ ∈ Rn1×n2 . Let (U∗,Σ∗, V ∗) bean SVD of A∗ and, for each i, let (Ui,Σi, Vi) be an SVD of Ai. We partition the indicesfor σ∗ by the K-tuple k according to equality of singular values, such that σ∗1 = . . . =σ∗k1 > σ∗k1+1 = . . . = σ∗k2 > . . . > σ∗kK−1+1 = . . . = σ∗kK > 0. Since Ai → A∗, their

singular values also converge, σi → σ∗ (e.g. [103]). We can hence conclude from [24, 101]that there exists a sequence of block diagonal, orthogonal matrices Wi ∈ Rr

∗×r∗ with blocksizes k1, k2 − k1, . . . , kK − kK−1 such that

UiWi → U∗. (5.17)

Due to their block structure, these matrices Wi commute with Σ∗.We have to show that the tensors A+

i := τr(Ui, Y+i ) = τr((M(2))∗(Ui,ΣiV Ti )) converge to

(A∗)+ := τr(U∗, (Y ∗)+) = τr((M(2))∗(U∗,Σ∗(V ∗)T )). For fixed j = 1, . . . , n2, we have

(Y +i ):,j = argmin

Y:,j

∥∥∥∥(

(Ui)P:,j ,:

cΣ−1i

)Y:,j −

(MP:,j ,j

0

)∥∥∥∥ , (5.18)

(A+i ):,j = Ui (Y +

i ):,j ,

for a constant c that is equal for each i and the analogously defined case for (Y ∗)+:,j . It

follows that

(A+i ):,j = Ui

((Ui)

TP:,j ,:(Ui)P:,j ,: + c2Σ−2

i

)−1

(Ui)TP:,j ,:MP:,j ,j (5.19)

= [UiWi

−1([(Ui)P:,j ,:Wi]

T [(Ui)P:,j ,:Wi] + c2[WTi Σ−2

i Wi])

︸ ︷︷ ︸=:Bi∈Rr∗×r∗

[(Ui)P:,j ,:Wi]TMP:,j ,j .

Due to Eq. (5.17), it holds (Ui)P:,j ,:Wi → U∗P:,j ,:and further WT

i Σ−2i Wi − (Σ∗)−2 =

WTi Σ−2

i Wi − WTi (Σ∗)−2Wi = WT

i (Σ−2i − (Σ∗)−2)Wi → 0. We treat the cases ω = 0

and ω > 0 separately:

(i) ω = 0: We have c = 0. If and only if the iRIP in the second component is ful-filled for A∗ (with respect to the sampling operator), then, for the smallest singular value,

Page 121: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 105

σmin(U∗P:,j ,:) > 0 and therefore (A+

i ):,j → (A∗)+:,j .

(ii) ω > 0: Here, the same result follows since σmin(c2(Σ∗)−2) > 0 and σmin(U∗P:,j ,:) ≥ 0.

This proves fixed-rank stability.

Stability: Let now Ai have arbitrary ranks. Without loss of generality, by considerationof a finite amount of subsequences, we can assume that rank(Ai) ≡ r for all i. Since T≤r =⋃r≤r Tr is closed, r ≥ r∗ must hold true, by which we may have (σi)kK+1, . . . , (σi)kK+1

→ 0.We expand the matrices Wi by identities of appropriate sizes to account for these vanishingsingular values, Wi ← diag(Wi, IkK+1−kK ), such that now Bi ∈ Rr×r in Eq. (5.19). LetΩ = 1, . . . , r∗2, Ω = r∗ + 1, . . . , r2. Then

σmin([WTi Σ−2

i Wi]|Ω)→∞.

We can conclude with Lemma 5.14 that Bi|Ω → B∗ as well as Bi|Ωc → 0. Due to thisrestriction, we again obtain convergence (A+

i ):,j → ((A∗)+)+:,j , since all parts that correspond

to vanishing singular values, also vanish in the limit of Bi. This finishes the proof.

Under mild conditions towards the sampling set, almost every matrix A fulfills the iRIP.However, those which do not, can be densely scattered (as discussed for the tensor trainformat in [42]GrKr19):

Proposition 5.16 (iRIP under perturbation). Let B ∈ Rn1×n2 be a matrix with singularvalues σ(B). Assume that for one j ∈ 1, . . . , n2, it holds |P:,j | < n1 (otherwise, every pointis sampled). Then for every σ∗ > 0 and r ≤ min(n1, n2), there exists a matrix A with rank

r and ‖A−B‖2F ≤∑∞i=r(σ

(B)i )2 + (σ∗)2, such that A does not fulfill the iRIP in the second

component (see Eq. (5.15)).

Hence if B is already rank r, then ‖A−B‖2F ≤ (σr)2 + (σ∗)2 suffices.

Proof. We assume without loss of generality that P:,j = 1, . . . , k, k = |P:,j | < n1. Let

B = UΣ(B)V T be an SVD of B, and

U:,1,...,r =:

(X xY y

), U :=

(X xY y

), X ∈ Rk×r−1.

If X is already singular, then we may choose x = x and y = y (that is, A = B). Otherwise,we may choose x = aXv, y = ay for a = ‖(Xv; y)‖−1

2 and v = −(XTX)−1Y T y, for anarbitrary vector y 6= 0. In all three cases, U is orthogonal and we obtain an SVD defining

A := U · diag(σ(B)1 , . . . , σ

(B)r−1, σ

∗) · (V:,1,...,r)T .

For this matrix, ‖A−B‖ ≤∑∞i=r(σ(B)i )2 + (σ∗)2 holds true. Yet UP:,j ,: is a singular matrix

and A does not fulfill the iRIP in the second component.

5.3.2 Algorithmic Aspects and Rank Adaption

In one sweep of the stable alternating least squares approximation, the two stabilized meth-ods (M(1))∗ and (M(2))∗ are applied. Before each update, any current singular valueσi < σmin is replaced by an artificial value σmin > 0, which in each iteration is set asfraction fmin 1 of the current residual (cf. Algorithm 10). The influence of this manip-ulation on the subsequent step is thereby marginal, but it is necessary since otherwise σimay irreversibly converge to zero (cf. Section 5.4.3). In the following algorithm, we as usual

denote Σ = diag(σ) and Σ = diag(σ).

Since the optimization relies on the magnitude of singular values, explicit rank adaption

Page 122: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

106 5.3. Stability and Rank Adaption

Algorithm 9 Stable Matrix Completion [42]GrKr19

Input: limit σmin, parameter ω, initial guess A = τr(X,Y ) ∈ Rn1×n2 such that X containsthe left singular vectors of A, and data points M |P ∈ Rm

Output: updated representation (X,Y ) after application of left- and right-sided stable

microsteps M(1)∗,M(2)∗

1: function msalsasweep(X,Y, σmin, ω)2: compute the SVD UΣV T ← Y and update σi ← max(σi, σmin), i = 1, . . . , r3: set X ← XUΣ and Y ← V T . Y is now row-orthogonal4: for i = 1, . . . , n1 update . cf. Theorem 5.12

Xi,: ← argminXi,:

‖Xi,:Y −Mi,:‖2Pi,: +|Pi,:|n2

ω2ζ2‖Xi,:Σ−1‖2F (5.20)

5: compute the SVD UΣV T ← X and update σi ← max(σi, σmin), i = 1, . . . , r6: set X ← U and Y ← ΣV TY . Y is now column-orthogonal7: for j = 1, . . . , n2 update . cf. Theorem 5.12

Y:,j ← argminY:,j

‖XY:,j −M:,j‖2P:,j+|P:,j |n1

ω2ζ1‖Σ−1Y:,j‖2F (5.21)

8: return updated representation (X,Y )9: end function

is near unnecessary. Instead of an explicit rank increasing strategy, the parameter ω, start-ing from a large value proportional to the norm of the initial iterate A, is slowly decreased.To limit the computational complexity, the rank still needs to be adapted, but this canbe done without significantly influencing the micro-steps. The singular values are therefordivided into two types:

Definition 5.17 (Stabilized rank and minor singular values [42]GrKr19). A singular value σi iscalled stabilized, if it is larger than a certain fixed fraction of ω (meaning any correspondingterms have an increased influence, cf. Eq. (5.16)). Otherwise, it is called minor (as aremoval of such does not notably change the next few steps). The stabilized rank only countsthe number of stabilized singular values.

The behavior of the singular values with respect to the parameter is displayed in Fig. 5.10for the tensor train format. The rank is only modified as to make sure that there is alwaysa fixed, small amount of minor singular values, i.e.

|i | 0 < σi < fminor · ω| != kminor, (5.22)

for constants fminor < 1 and kminor ∈ N. The representation is correspondingly either trun-cated to a lower rank, or augmented with random singular vectors and minor singular values.

This states the basic concept of implicit rank adaption and we will provide a more de-tailed discussion in the later Section 5.6.2 for the elaborate tensor case. A validation set isused in order to determine when the algorithm has reached a good value ω.

Definition 5.18 (Validation set [42]GrKr19). For a given P , the sampling or training set,we define Pval ⊂ P as validation set. This set may be chosen randomly or specificallydistributed. The actual set used for the optimization is replaced by P ← P \ Pval (keepingthe same symbol).

Page 123: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 107

Algorithm 10 Salsa Algorithm [42]GrKr19

Input: sampling set P ⊂ Ω and measurements M |POutput: representation (X,Y ) that approximately the solution to ‖XY −M‖P → min

1: procedure matrixsalsa(P,M |P )2: initialize X,Y s.t. τr(X,Y ) ≡ ‖M |P ‖1/m for r ≡ 1 and ω = 1

2 · ‖τr(X,Y )‖F3: split off a small validation set Pval ⊂ P . Definition 5.184: for iter = 1, 2, . . . do

5: renew lower limit σmin ← fσmin·√|Ω|√m‖τr(X,Y )−M‖P . |Ω| = n1n2

6: (X,Y )← msalsasweep(X,Y, σmin, ω) . Algorithm 97: decrease ω8: adapt rank according to Eq. (5.22)9: if a stopping criterion applies then . terminates algorithm

10: return iterate (X,Y ) for which ‖τr(X,Y )−M‖Pvalwas lowest

11: end if12: end for13: end procedure

The stopping criteria in Algorithm 10 may depend on the behavior of Pval, or may simplybe based on a rank bound, e.g. r ≤ m/(n1 +n2). The latter criterion, however, only sufficesin the matrix case.

5.4 The Close Connection between Stabilization andReweighted l1-Minimization

For the sake of simplicity, we have so far only considered matrix completion, whereas theoverall objective of stable methods is an adaptivity of the multivariate rank in the morecomplex tensor setting. The derivation of the stable matrix method has hence also beensubject to the practical restrictions regarding tensors, and has served as introduction to thelater sections.

In the matrix recovery setting, there is a close connection to reweighted l1-minimization.Here one asks, as described in Section 4.3, for the recovery of a matrix M ∈ Rn1×n2 givenm ∈ N linear measurements y = L(M) ∈ Rm, where L : Rn1×n2 → Rm is a linear map.Previously, this map has been the sampling operator L = (·) 7→ (·)|P , for m = |P |. Therecovery can be approached similarly through the minimization

find argminA∈Rn1×n2

‖L(A)− y‖2, subject to rank(A) ≤ r.

It is straightforward to apply the stabilization procedures in Section 5.1.1 to this more general

case, which yields scaling matrices S(1)L and S

(2)L defined through (Cholesky decompositions

of the) partial traces3

(S(2)L )TS

(2)L := trace2(LTL) ∈ Rn1×n1 , (5.23)

S(1)L (S

(1)L )T := trace1(LTL) ∈ Rn2×n2 .

3trace1(A1⊗A2) = trace(A1)A2 and trace2(A1⊗A2) = trace(A2)A1 (cf. Remark 2.30 and Definition 7.3)

Page 124: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

108 5.4. The Close Connection between Stabilization and Reweighted l1-Minimization

where L ∈ Rm×n1n2 is the matrix that represents the linear operator L (cf. Section 4.3).As generalization of Theorem 5.12 (or special case of the later Theorem 5.24), we obtainstabilized micro-steps also for this setting, given by

(M(1)L )∗(X,Y ) := (X+, Y ),

X+ = argminX

‖L(XY )− y‖22 + ω2ζ2‖S(2)L XΣ−1‖2F ,

(M(2)L )∗(X,Y ) := (X,Y +),

Y + = argminY

‖L(XY )− y‖22 + ω2ζ1‖Σ−1Y S(1)L ‖2F . (5.24)

For matrix completion, the scaling matrices S(1)L and S

(2)L are diagonal. The minimizers of

these micro-steps can be calculated efficiently, in particular when the operator L exhibitssome low-rank structure itself (cf. Section 4.3). We will discuss this in detail for arbitrarytree tensor networks in Section 5.5.

5.4.1 Reweighted l1-Minimization

Another approach to recover M is nuclear norm minimization,

find argminA∈Rn1×n2

‖A‖∗, subject to L(A) = y,

assuming (for now) that y = L(M) is not subject to noise. Here, the (convex relaxationof the) rank is the target function itself, and the linear equation the constraint. This ideahas its origin in compressed sensing (an introduction can be found in [15]), in which onewants to recover a sparse vector u ∈ Rn given linear measurements y = Φ(u). Since thecombinatorial problem

find argminx∈Rn

‖x‖0, subject to Φ(x) = y,

is in general NP-hard [78], it is relaxed to the l1-minimization problem

find argminx∈Rn

‖x‖1, subject to Φ(x) = y.

If Φ fulfills a sparsity related RIP, the relaxed problem solves the original one (also un-der noisy data, [12]). A subsequent amendment to this approach is iterative, reweightedl1-minimization. In particular [16,19] have suggested and analyzed the following Irls (iter-atively reweighted least squares) algorithm, which for suitable starting values x, repeats

1. update the weight matrix W = (diag(|x1|, . . . , |xn|) + εIn)−1

2. solve

x = argminx∈Rn

‖Wx‖1, subject to Φ(x) = y, (5.25)

given a small parameter ε > 0, that may be decreased step by step during the optimiza-tion. Based on empirical results [16], this modified version appears to be strongly preferabletowards regular l1-minimization. Convergence results related to the restricted isometryproperty (RIP) of Φ have further been established in [19].

The two articles [32,76] on matrix recovery make use of this principle idea, and describe aniterative procedure called Irlsp which in essence4 is given by

4Details differ depending on the article as the methods are subject to computational and theoreticalconsiderations.

Page 125: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 109

1. set the SVD UΣV T = A

2. lower ε and update W = U(Σε)−1UT

3. solve

A = argminA∈Rn1×n2

‖W 1−p/2A‖F , subject to L(A) = y, (5.26)

for p = 1 in [32] and 0 ≤ p ≤ 1 in [76]. The diagonal matrix Σε is a modified version of thesingular values, and may follow the same idea as in Algorithm 9, where (σε)i = max(σi, ε) forall possible i (except that ε is therein called σmin). Both works proof recovery properties,although only for the case that p = 1 and that L fulfills a certain RIP (or equivalentlya null-space property)5. A key idea to this is the fact that the above procedure can bewritten as alternating minimization of certain functionals. Each step hence yields a strictimprovement with respect to that functional. Both articles also describes how the singlesteps can be performed efficiently if L is a sampling operator, i.e. in the matrix completioncase. A central motivation of the approach is its connection to nuclear norm minimization(for p = 1),

‖W 1/2A‖2F = ‖(U(Σε)−1/2UT )(UΣV T )‖2F −→

ε0‖Σ1/2‖2F = ‖A‖∗,

and rank minimization (for p = 0),

‖WA‖2F = ‖(U(Σε)−1UT )(UΣV T )‖2F −→

ε0rank(A).

In contrast to recovery guaranties, which only exist so far for p = 1, numerical tests (cf. [76])suggest that the latter case, p = 0, yields the best working algorithm Irls0.

The afore-mentioned article [32] makes use of the particular structure for matrix completionin order to lower the computational complexity of the main step Eq. (5.26) for p = 1. Theidea is easily generalized to p = 0 and allows to make a simple comparison. We generate arank r = 4 matrix M ∈ R20×20 as well as a sampling set P with m = 170 points randomly.To this fixed problem setting, we apply Irlsp once for p = 1 and once for p = 0, following thedevelopment of the first 6 singular values of the iterate A with respect to the iteration num-ber. We initialize A randomly, and in each iteration, we lower ε by a fixed fraction, startingwith a large enough value. The two tests are shown in Fig. 5.3. Although we here show onlyone comparison, other trials exhibit the same qualitative behavior and the algorithm Irls0

is consistently superior to Irls1. It appears that, for p = 1, the amount of regularizationtowards large singular values is somehow too strong, while for low singular values it is tooweak. This in certain ways matches the analysis in Section 5.4.3, in particular regarding thefixed points visualized in Figs. 5.4 and 5.5. The functional can also be symmetricized usingtwo weight matrices WU and WV , each single one corresponding to a value p = 1. This ideathen yields the step

WU := U(Σε)−1UT ∈ Rn1×n1 , WV := V (Σε)

−1V T ∈ Rn2×n2 ,

A := argminA∈Rn1×n2

‖W 1/2U A W

1/2V ‖F , subject to L(A) = y. (5.27)

Also in this case, the term converges to the rank of A, e.g.

‖W 1/2U A W

1/2V ‖2F = ‖(U(Σε)

−1/2UT )(UΣV T )(V (Σε)−1/2V T )‖2F −→

ε0rank(A).

5The theorem can hence not be applied to matrix completion, but the algorithm seems to work wellnonetheless.

Page 126: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

110 5.4. The Close Connection between Stabilization and Reweighted l1-Minimization

0 50 100 150 2000

0.2

0.4

0.6

iteration

singularvalues

IRLS1

0 50 100 150 2000

0.2

0.4

0.6

iteration

IRLS0

Figure 5.3: The magnitudes of the first 6 singular values of A with respect to the iteration number (inblue), for p = 1 (left) and p = 0 (right). On each right side, the true singular values of the sought rank 4recovery M are shown (in black, dashed).

There are several possibilities to handle noise, one of which is to relax Eq. (5.26) to

A = argminA∈Rn1×n2

‖L(A)− y‖22 + ω2‖W 1−p/2A‖2F . (5.28)

Another approach is to relax the linear equation L(A) = y to an approximate one, L(A) ≈ y,subject to a certain tolerance.

Since the weight matrices W depend on the current singular vectors, the algorithm slowlyexposes a suitable basis. The sparse use of this basis is then enforced through the penaltygiven by the corresponding inverse singular values. Some works suggest to instead penalizejust the magnitude of singular values. This yields the steps

A = argminA∈Rn1×n2

i

w1−p/2i σi(A), subject to L(A) = y,

where each single weight wi = σi(A)−1 is the inverse of the i-th singular value of the currentiterate A. Here however, the algorithm neither learns a basis, nor does it put the singularvalues into relation to it, which appears to be a strong property of the above procedures.

5.4.2 SALSA as Scaled Alternating Reweighted Least Squares

Interesting from the perspective of Section 5.2 is what happens when we insert the low-rankrepresentation A = XY , where the r first columns of U equal X ∈ Rn1×r, into Eq. (5.28)for p = 0. We obtain

Y = argminY

‖L(XY )− y‖22 + ω2‖WXY ‖

= argminY

‖L(XY )− y‖22 + ω2‖U(Σε)−1UTXY ‖2F

= argminY

‖L(XY )− y‖22 + ω2‖(Σr,ε)−1Y ‖2F ,

where Σr,ε is the upper-left r×r-submatrix of Σε. Up to scaling, this is exactly the previous,stabilized micro-step Eq. (5.24). Even the lower limit to singular values follows the samescheme as in [32], and in both cases a slower decline of ε and ω usually yields better results.

Page 127: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 111

In fact, for randomized measurements, S(2)L (cf. Eq. (5.23)) is close to a multiple of the

identity. An analogous relation for the first micro-step is obtained with respect to theminimization task

A = argminA∈Rn1×n2

‖L(A)− y‖22 + ω2‖AW‖2F . (5.29)

When optimizing on the low-rank data space Dr (cf. Section 5.1.1), we cannot enforce

L(A) = y, as otherwise one would need to have already solved the problem.

Remark 5.19 (Relation of Salsa and Irls). The prior considerations suggest thatwe may interpret stabilized ALS, Algorithm 9, as scaled, relaxed, alternating Irls0

method with correspondingly switching, adaptive weights as in Eqs. (5.28) and (5.29).

The alternating version may also be symmetricized (or use the same one-sided weight matrixfor both micro-steps). If A = UrΣrV

Tr is a truncated SVD of A = XY , and we fix X =

Ur ∈ Rn1×r in Eq. (5.27), then this yields the update

Y = argminY

‖L(XY )− y‖22 + ω2‖(Σr,ε)−1/2 Y [Vr(Σr,ε)−1/2V Tr + ε−1/2(In2

− VrV Tr )]‖2F .

The update of the other factor X, when Y is fixed, follows an analogous formula. Shortnumerical experiments however suggest that this does not visibly change the reconstructionquality, whereas switching weights are much easier to handle and lead to a lower compu-tational complexity, in particular if L has low-rank structure. Identical weights are likelyeasier to analyze, though.

The algorithms Salsa and Irls0 yield nearly the same quality of matrix recoveries, whichis not surprising due to their similarity, and are both superior (cf. [32, 76]) to nuclear normminimization, or Irls1. The algorithm Salsa origins from a fixed-rank ALS method andis subject to a certain averaging to be stable, with the aim of being able to better adaptthe rank. On the other hand, Irls0 initially minimizes a relaxation of the rank and thenuses adaptive weights to improve the approximation quality. So it is quite interesting thatthrough the respective improvements, both approaches essentially coincide. Theoretical re-covery guaranties are still missing for both algorithms, whereas the results for Irls1, asexpected, resemble such for nuclear norm minimization, and do not hold for completion(point-wise sampled entries). Thus, a recovery and completion guaranty for the case p = 0,or at least p < 1, remains a highly desirable objective. Possibly, approaches therefor neednot be related to nuclear norm minimization, given that the (arguably rougher) derivationof Salsa for example is different, but the situation is yet unclear.

5.4.3 Fixed Points of Idealized Stable Alternating Least Squaresfor Matrices

Relaxing the condition L(A) = y in a straightforward way as in Eq. (5.28) introduces aslight offset in alternating optimization. Consider for example matrix completion for the

sampling operator L(A) = A|P , without scaling matrices S(1)L and S

(2)L (cf. Eq. (5.23)). In

the situation as in Section 5.4.2, let the iterate have an SVD A = UrΣrVTr = XY for an

orthogonal X = Ur ∈ Rn1×r. The update of each column Y:,j of Y is then given as (cf.Algorithm 10)

Y:,j = argminY:,j

‖XY:,j −M:,j‖2P:,j+ ω2‖Σ−1

r,ε Y:,j‖2F .

Page 128: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

112 5.4. The Close Connection between Stabilization and Reweighted l1-Minimization

Therefore,

Y:,j = (XTP:,j ,:XP:,j ,: + ω2Σ−2

r,ε)−1 XTP:,j ,:MP:,j ,j .

We assume now that the sampling subsets P:,j are large enough such that

XTP:,j ,:XP:,j ,: ≈ Ir

mj

n1, mj := |P:,j |. (5.30)

If the sought-for matrix M = UrΣ(M)r V Tr has the same singular vectors as A (see Re-

mark 5.20), and A is a fixed point, then under Eq. (5.30) we obtain

ΣrVTr = (Ir

mj

n1+ ω2Σ−2

r,ε)−1 mj

n1Σ(M)r V Tr

⇔ Σr = (Ir +ω2n1

mjΣ−2r,ε)−1 Σ(M)

r .

In this case, the singular values of A are a modified version of those of M . We observe thatthe strength of regularization is different for each column Y:,j . Furthermore, the analogousupdate of X leads to a different modification of σ(M) per row. This strongly indicates thatsome sort of scaling is more natural. For a general operator L, a scaled update to Y is givenas

Y + = argminY

‖L vec(UrY )− y‖22 + ω2‖Σ−1r,ε Y S‖2F ,

where for now S is set as some to be determined weight matrix only depending on L. Thesolution of the minimization problem, for W := LTL, is given by (here, ⊗ is the matrixKronecker product)

vec(Y ) = [(In2 ⊗ Ur)TW (In2 ⊗ Ur) + SST ⊗ ω2Σ−2r,ε ]−1(In2 ⊗ Ur)TWvec(M).

We do not know anything in general about the singular vectors Ur. So as we search for aconstant scaling matrix, it is not unreasonable (although certainly naive) to assume thateach single multiplication with an orthogonal U results in the corresponding expectancyvalue. We denote this assumption with (∗), such that for example for all B ∈ Rk×k,

QTr BQ∗= EQ∈V QTr BQ =

1

|V |

V

QTr BQ dQCor. 5.11

=trace(B)

k·(Ir 0

), (5.31)

where V := Q ∈ Rk×k | Q is orthogonal . The variance of this expectancy value thendepends on the suitability of the matrix B. A first implication is

F := (In2⊗ Ur)TW (In2

⊗ Ur) + SST ⊗ ω2Σ−2r,ε

∗=

tr2(W )

n1⊗ Ir + SST ⊗ ω2Σ−2

r,ε . (5.32)

These two sums only fit together if SST = tr2(W )n1

, which coincides with S = S(2)L in

Eq. (5.24). In that case,

F∗=

tr2(W )

n1⊗ (Ir + ω2Σ−2

r,ε).

If L is a sampling operator with suitably sorted rows, then

tr2(W ) = diag(m1, . . . ,mn2),

Page 129: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 113

which is the scaling given in Eq. (5.21).

For orthogonal complements U⊥ ∈ Rn1×n1−r and V⊥ ∈ Rn2×n2−r, UTr U⊥ = 0, V Tr V⊥ = 0,let K be such that

M =(Ur U⊥

)(Kr K12

K21 K22

)(Vr V⊥

)T.

Assume further that A = UrΣrVTr = UrY is a fixed point of the micro-step f

(M(2)L )∗

(as

defined through Eq. (5.24) and Definition 5.1, for ε-perturbed singular value weights). Then

vec(ΣrVTr ) = vec(Y ) = F−1(In2

⊗ Ur)TW (In2⊗ U)(V ⊗ In1

) vec(K)

∗= F−1(

tr2(W )

n1⊗(Ir 0

))(V ⊗ In1) vec(K)

∗= (In2 ⊗

((Ir + ω2Σ−2

r,ε)−1 0))(V ⊗ In1) vec(K)

= (V ⊗((Ir + ω2Σ−2

r,ε)−1 0)) vec(K)

which is equivalent to

(Ir + ω2Σ−2r,ε)−1

(Kr K12

)V T = ΣrV

Tr .

It follows that Kr = (Ir + ω2Σ−2r,ε)Σr and K12 = 0.

Analogously, assuming A is a fixed point of f(M(1)

L )∗, we can conclude K21 = 0. Note

that we only obtain the same result for Kr for both micro-steps due to the scaling matrices

S(1)L and S

(2)L . As mentioned before, without such, we would apply different magnitudes of

regularization to the different components of Y (and X, respectively), leading to differentresults for Kr. The two micro-steps would in that sense not be conform to each other evenin the setting of this idealized discussion.

Due to the above, Kr is a diagonal matrix with entries given by the singular values σ(M) ofM , and the columns of Ur, Vr are corresponding singular vectors of M as well. The fixedpoint A hence containes modified versions of the singular values of M . We summarize thisdiscussion in the following remark.

Remark 5.20 (Idealized fixed points). Let A ∈ Rn1×n2 be a fixed point of both f(M(1)

L )∗

and f(M(2)

L )∗(for ε = 0) as well as

M = U (M) diag(σ(M)) V (M)T .

Then under the (naive) assumption (∗), it follows

A = U (M) diag(σ) V (M)T ,

σi ∈ 0, (1 + ω2σ−2i )−1σ

(M)i ,

for i = 1, . . . ,min(n1, n2).

Certainly, the assumption (∗) does not hold as equality6 for realistic measurement oper-ators L, except if in fact L = In1n2 . We end this simplified discussion with the remark that

6It does however approximately hold true if the operator L admits a low RIP constant, or for matrixcompletion if the sampling set is large. Likewise does the variance regarding the expectancy value inEq. (5.31) become small.

Page 130: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

114 5.4. The Close Connection between Stabilization and Reweighted l1-Minimization

even if (∗) only holds approximately, the matrix F , defined in Eq. (5.32), may still be verysimilar to the one obtained under the strict assumption (∗) due to the regularizing term Σ−2

r,ε .

The purpose of the previous discussion is to reveal a reason for the scaling matrices S(1)L

and S(2)L . It also shows that the fixed points of the map σ 7→ (1 + ω2σ−2)−1σ(M) and the

lower bound (1 +ω2ε−2)−1σ(M) are of interest. For each pair (σ(M), ω2), the only attractive

fixed point (if existent) is given by fstab = 12σ

(M) + 12

√(σ(M))2 − 4ω2 and the repelling one

by frep = 12σ

(M) − 12

√(σ(M))2 − 4ω2. Continuously complementing this map at 0 with the

function value σ = 0 reveals a third (attractive) fixed point. At the point where fstab = frep,it holds σ = ω = 1

2σ(M) (see Fig. 5.4). When we apply the same analysis for the case p = 1,

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

σ(M)

σ

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

12σ

(M)

σ(M)

σ

Figure 5.4: ([42]GrKr19) Left: Plotted are the fixed points (continuous for attractive, dashed for repellingones, in teal) of σ 7→ (1 + ω2σ−2)−1σ(M) for one fixed ω with respect to σ(M). Within the hatched area,singular values rise until they reach the upper boundary. The lower bound (1 + ω2ε−2)−1σ(M) is indicatedas dotted, magenta line. Right: Different values of ω are considered. The turning points are given byσ = ω = 1

2σ(M).

instead of p = 0, we obtain the same fixed point scheme, but for the map

σ → (1 + ωσ−1)−1σ(M).

The attractive fixed point of this map is σ = σ(M)−ω for σ(M) ≥ ω and σ = 0 otherwise asshown in Fig. 5.5. A lower limit ε is hence not strictly necessary. Further, the fixed pointas map dependent on σ(M) is exactly the soft-thresholding operation with respect to theparameter ω.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

σ(M)

σ

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

12σ

(M)

σ(M)

σ

Figure 5.5: Left: Plotted is the attractive fixed point (in teal) of σ 7→ (1 + ωσ−1)−1σ(M) for one fixedω with respect to σ(M). Within the hatched area, singular values are increased by application of the map.Right: Different values of ω (equal to those in Fig. 5.4) are considered.

Page 131: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 115

5.5 Stable Alternating Least Squares Tensor Recoveryand Completion

As layed out in Section 4.5, tensor recovery asks to find a tensor M ∈ Rn1×...×nd givenonly a set of linear measurements y = L(M) ∈ Rm (in the noise-free case) under low-rank assumptions associated to a family K. Its least squares formulation is as follows (cf.Eq. (4.20)):

find argminN∈Rn1×...×nd

‖L(N)− y‖F subject to rankJ(N) ≤ r(J), J ∈ K

The nuclear norm of a tensor is NP-hard to calculate, such that a directly related approachis problematic. Instead, as we are faced with a set of matricizations, the basic idea can betransferred to this setting using multiple weight matrices. Again, constraint and objectivechange places:

find argminN∈Rn1×...×nd

J∈K‖N (J)‖∗, subject to L(N) = y

Accordinly, the reweighted least squares version, for 0 ≤ p ≤ 1, takes the following form,and is subject to the same considerations as in Section 5.4.1:

N+ = argminN∈Rn1×...×nd

J∈K‖W 1−p/2

J N (J)‖2F , subject to L(N) = y, (5.33)

The weight matricesWJ := UJΣ(J)−1UTJ ∈ RnJ×nJ are defined through the SVDs UJΣ(J)V TJ

of the different matricizations N (J) (cf. Eq. (5.26)) of the prior iterate N .

A vanilla version of this approach is given in Algorithm 11 (which utilizes tensor node arith-metic). As short numerical demonstration, we consider the completion of a small fourthorder tensor M ∈ R5×5×5×5 for the (not hierarchical) family

K = 1, . . . , 4, 1, 2, 1, 3, . . . , 3, 4 = S ⊂ 1, . . . , 4 | |S| ≤ 2

and the value p = 0. The tensor M is initialized randomly, but its singular values areiteratively fit until (approximately) sv(M (J)) = (1/4, 1/16, 1/32, 1/64, 1/128, 0, . . .), for allJ ∈ K, using a method analogous to Algorithm 14 (proving that these singular values arefeasible). The sampling set is likewise chosen randomly of size m = |P | = 300, such thatslightly more than half of the entries of M are missing. Similar to the matrix Irls0 experi-ment, Fig. 5.3, we follow the development of the first few singular values of each of the twounfoldings N (1) and N (1,2) with respect to the iteration number.

One can observe the typical behavior in Fig. 5.6 in which the exponentially decaying singu-lar values are approached one by one, each one once ε is low enough (cf. Fig. 5.10). Theslow decay of this parameter is essential to obtain a good approximation. Both a too smallinitialization or a too rapid decay will usually lead to worse results as one can quickly observe.

This experiment importantly shows that reweighted least squares is in principle general-izable to tensors with respect to families of multilinear ranks, and despite being slow, itworks very well. We refrain from plotting results for the same experiment using p = 1,since the algorithm basically fails. The relaxation of the linear equality L(N) = y on theother hand only slightly reduces the approximation quality, and the corresponding plotslook quite alike. While it is in principle easy to calculate the update N+ as in the previ-ous experiment (cf. Fig. 5.6), it becomes increasingly costly with larger mode sizes n and

Page 132: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

116 5.5. Stable Alternating Least Squares Tensor Recovery and Completion

Algorithm 11 Tensor iterative reweighted least squares (Irls)

Input: (not necessarily hierarchical) family K, measurements y = y(ζ) and tensor L = L(ζ,α)representing L, initial guess N = N(α) ∈ Rn1×...×nd (and parameter p, starting value ε > 0,ratio fε < 1)

Output: approximate recovery N of M , y = LM

1: procedure TIRLS(K, y, L,N)2: calculate the kernel representation K = K(α, τ) of L . Lz = 0⇔ ∃x : z = Kx3: calculate one (unconstrained) solution N0 = N0(α), y = LN0

4: for iter = 1, 2, . . . do5: for J ∈ K do6: do an SVD of N with respect to αJ :

(UJ , σ(J), VJ)← SVDαJ (N), σ(J) = σ(J)(γ)

7: perturb (σε)j ← max(ε, σj), for all j = 1, . . . ,min(nJ , nD\J) . or similar

8: set WJ ← γ(UJ , (Σ(J))p/2−1, UJ) . WJ = UJ γ diagγ((Σ(J))p/2−1)γ UJ

9: end for

10: set H ←∑J∈Kα(K,WJ ,K) . H = K α (

∑J∈KWJ ∅ IαD\J )α K

11: set b← −∑J∈Kα(K,WJ , N0) . b = K α (

∑J∈KWJ ∅ IαD\J )α N0

12: solve H x = b, x = x(τ) . H = H(τ, τ), b = b(τ)13: update N ← N0 +K x . still LN = y14: lower ε← fε · ε15: if termination criterion is met then16: return N . if successful N ≈M , y = LM17: end if18: end for19: end procedure

0 100 200 30010−4

10−3

10−2

10−1

100

iteration

singularvalues

J = 1

0 100 200 30010−5

10−4

10−3

10−2

10−1

100

iteration

J = 1, 2

Figure 5.6: The magnitudes of the 5 singular values of N(1) (left) and the first 7 singular values ofN(1,2) (right) with respect to the iteration number (in blue), for p = 0. On each right side, the truesingular values of the sought rank 5 recovery M are shown (in black, dashed).

dimension d. As we have seen, hierarchical families K of matricizations correspond to treetensor networks (cf. Section 3.3.1). In the following, we denote the associated graph withG = (V,E) and its legs with L (cf. Definition 3.2). We can then follow the same approachas for matrices and optimize on a low-rank representation. We now interpret N as tensor

Page 133: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 117

node N = N(α1, . . . , αd) (as in Algorithm 11):

N = τr(N) := v∈VNv, N = Nvv∈V ∈ Dr, (5.34)

where Dr is the data space (cf. Remark 3.3, analogous to the one in Section 5.1.1), for amultilinear rank r ∈ NK of N . The concept of stability, Definition 5.1, thereby extends tothis setting. As in Chapter 4, we usually omit the representation map τr and use the nodeN and network N synonymously.

As then N is required to truly have low rank, it is as before not possible7 to meet the

equation L(N) = y, such that this condition is again relaxed. A single micro-step M(v)L

updates one node Nv, v ∈ V , while the other ones remain fixed (cf. Eq. (4.11)). Analogousto Section 4.5, the objective function in each step becomes

N+v = argmin

Nv

‖L(N)− y‖2 + cω2∑

J∈K‖W 1−p/2

J (N 6=v Nv)(αJ )‖2F . (5.35)

We have here introduced a redundant constant c = mn1...nd

to compensate for the different

magnitudes of m and the domain of N (cf. Lemma 6.5), which allows an easier comparisonwith subsequent considerations.Yet even in this case, the computational complexity per micro-step will grow at least linearin d. To obtain a stable micro-step, we will see that it suffices to restrict K to all sets thatcorrespond to the neighboring singular values of Nv.

5.5.1 Restriction to Neighboring Singular Values and the TensorVariational Residual Function

We introduce a suitable restriction of the family K as well as its relation to the variationalresidual function for tensors (analogous to Definition 5.7).

Let c ∈ V be (a) fixed (root) throughout this section. For any subset of nodes V ⊂ V ,we continue to use the short notation

NV := v∈VNv, LV := ζv∈V Lv,

where hence NV = N and LV = L. For H ⊂ neighbor(c), we define

branchc(H) :=⋃

h∈Hbranchc(h),

JH :=⋃

h∈HJc,h = j ∈ 1, . . . , d | αj ∈ α ∩m(Nbranchc(H)),

Hc := neighbor(c) \H.

Note that hence αJH =⋃h∈H bc(h) (cf. Section 3.3.1). We define a restricted set of

matricizations K(c) as

K(c) = Jc,h | h ∈ neighbor(c) ⊂ J,D \ J | J ∈ K.

This set is chosen in such a way that the tuples of singular values of N , or edges in E,corresponding to K(c), given by

σc,h = sv(N (αJc,h )), h ∈ neighbor(c),

7In the sense that this would already finish the recovery

Page 134: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

118 5.5. Stable Alternating Least Squares Tensor Recovery and Completion

are those neighboring to Nc. The edge e = c, h ∈ E is also written as e = eJc,h (cf.

Section 3.3.1) and the singular values as σeJc,h = σ(Jc,h).

Let the network N be orthogonal with respect to c, and let each Nbranchc(w) = UJc,w form

the left singular vectors of N (bc(w)) = N (αJc,w ), w ∈ neighbor(c). For a tree SVD N σ ofN = τr(N) with hypergraph G = (V, E), this is the case if

Nc = w∈neighbor(c)Σc,w Ncand (thereby necessarily)

Nbranchc(w) = ω∈branchc(w)⊂V N σω . (5.36)

This is likewise the case if N = normal(N, c). Now, due to orthogonality constraints,Eq. (5.35) simplifies to

N+c = argmin

Nc

‖L(N)− y‖2 + cω2∑

w∈neighbor(c)

‖Σp/2−1c,w Nc‖2F . (5.37)

The singular values σc,w are possibly modified by some small value ε > 0 beforehand, butwe do not denote this further for better readability.

The weighted regularizer in Eq. (5.37) certainly provides a good model. It can quite easilyand efficiently be evaluated and is particularly simple. However, as for the matrix case, wewill see that there are certain arguments that call for further scaling of the penalty terms.Independent of such, the unscaled model Eq. (5.37) and the to be derived, scaled versionas in Theorem 5.24 yield stable micro-steps and updates for Nc (for a lower limit ε = 0, cf.[42]GrKr19).

The tensor version of the variational residual function as in Definition 5.7 is as follows.In [42]GrKr19, the tensor train version can be found.

Definition 5.21 (Variational residual function for tensors). Let ω ≥ 0, M be the targettensor, y be given measurements y = L(M) and let N be the current iterate. We define the

variational residual function in node c as C(c) := C(c)L,y,N by

C(c)(Nc) :=

V(c)ω (N)

‖L(w∈neighbor(c)(Nbranchc(w) + s(c)w ∆Xw) Nc)− y‖2 d∆X (5.38)

V(c)ω (N) := ∆X = ∆Xww∈neighbor(c) |

w∈neighbor(c)

‖∆Xw NV \branchc(w)‖2F ≤ ω2.

where m(∆Xw) = m(Nbranchc(w)) and s(c)w > 0 for each w ∈ neighbor(c). Note that V(c)

ω (N)

does not depend on the unknown Nc, but the previous iterate N.

This function can be motivated following the same ideas as in the matrix case. Themodified micro-steps are analogously defined by

(M(c)L )∗(N) := Nww 6=c ∪ N+

c , N+c := argmin

Nc

C(c)L,y,N (Nc). (5.39)

These micro-steps are discussed in detail in [42]GrKr19 for the tensor train format. It is alsoshown that they are representation independent, and the proof is easily generalized. It islikewise possible to calculate its minimizer, yet already for the tensor train format, this isvery technical. We will therefore not do so here, but instead motivate its generalization inthe same way as we analyzed the scaling matrices for the matrix case (cf. Section 5.4.3).

Page 135: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 119

5.5.2 About the Minimizer of the Tensor Variational Residual Func-tion

The following lemma is a central tool in the analysis of the scaling factors (and also in theomitted derivation of the minimizer of the variational residual function). It is a generalizationof the earlier matrix version, Lemma 5.10. For k = 2, this result is also stated in [42]GrKr19.

Lemma 5.22. Let H = H(δ, δ, τ ) ∈ Rd(δ)×d(δ)×d(τ ), for some d(δ), d(τ ) ∈ N, and δ =δ1, . . . , δk, k ∈ N. Further, let β = β1, . . . , βk and

Vi := Xi = Xi(δi, βi) ∈ Rd(δi)×d(βi) | ‖Xi‖F = ωi, ωi > 0,

for i = 1, . . . , k. Then

V1

. . .

Vk

(ki=1Xi)H (ki=1Xi) dX1 . . . dXk =

k∏

i=1

ω2i |Vi|

d(δi)d(βi)· traceδ(H) Iβ.

Proof. We may write H as sum of t ∈ N rank-one terms

H =

t∑

`=1

h1,` . . . hk,` hk+1,`,

hi,` = hi,`(δi, δi), i = 1, . . . , k, hk+1,` = hk+1,`(τ ),

for ` = 1, . . . , t. Through multilinearity, the integral simplifies to

t∑

`=1

(ki=1

Vi

Xi δi hi,` δi Xi dXi

) hk+1,`

Lemma 5.10=

t∑

`=1

(ki=1

ω2i |Vi|

d(δi)d(βi)· traceδi(hi,`) Iβi

) hk+1,`

=( k∏

i=1

ω2i |Vi|

d(δi)d(βi)

)·( t∑

`=1

ki=1traceδi(hi,`)) hk+1,` Iβ.

Due to the multilinearity of the trace, the middle term simplifies according to

t∑

`=1

ki=1traceδi(hi,`) hk+1,` =

t∑

`=1

traceδ(ki=1hi,` hk+1,`) = traceδ(H).

This yields the right-hand side above.

Corollary 5.23. In the situation of Lemma 5.22, let instead

Vi = Xi = Xi(δi, βi) ∈ Rd(δi)×d(βi) | Xi is βi-orthogonal ,

Then

1

|Vi|

V1

. . .1

|Vk|

Vk

(ki=1Xi)H (ki=1Xi) dX1 . . . dXk =traceδ(H)

d(δ) Iβ.

Proof. The proof is analogous to the one of Lemma 5.22 and relies on Corollary 5.11.

Based on Lemma 5.22 for k = 2, [42]GrKr19 derives the minimizer of the variationalresidual function Definition 5.21 for the tensor train format. As mentioned before, due tothe technicalities involved, we do not explicitly repeat the proof for arbitrary tree tensornetworks, but directly generalize the result in [42]GrKr19.

Page 136: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

120 5.5. Stable Alternating Least Squares Tensor Recovery and Completion

Theorem 5.24. Let the tree tensor network N be orthogonal with respect to c, and let eachNbranchc(h) form the left singular vectors of N (αJc,h ), h ∈ neighbor(c), as in Eq. (5.36).

Further, let N+c := argminNc C

(c)L,y,N (Nc) as in Eq. (5.39). Then the minimizer equals

N+c = argmin

Nc

‖L(N)− y‖2

+∑

∅6=H⊆neighbor(c)

ω2|H|ζHcH

‖Lh∈Hc Nbranchc(h) Nc h∈H Σp/2−1c,h ‖2F ,

for cH = d(αJH ) and certain constants ζH > 08.

Proof. The theorem is a generalization of the tensor train case as in [42]GrKr19.

The constants ζH are analogous to those in the matrix case (cf. Remark 5.13) and the

tensor train case [42]GrKr19 and also depend on the choosable constants s(c)w , w ∈ neighbor(c),

as in Definition 5.21. Up to machine precision (for moderately large mode sizes), these con-stants fulfill ζH =

∏h∈H ζh. In the following, we enforce 1 = ζh = ζw for all h,w,

hence essentially ignore these values (by chosing the right constants s(c)w ).

For Zbranchc(Hc) := Lh∈Hc Nbranchc(h) (analogous to Eq. (4.26)), we define

Kbranchc(Hc) = Kbranchc(Hc)(η,η) := Zbranchc(Hc) ζ∪αJH Zbranchc(Hc), (5.40)

η =⋃

h∈Hcm(c, h) ∪m(c) = m(Zbranchc(Hc)) ∩m(Nc). (5.41)

Due to the properties of the partial trace (cf. Remark 2.30), we have

Kbranchc(Hc) = traceαJH (Zbranchc(Hc) ζ Zbranchc(Hc)) (5.42)

= Nbranchc(Hc) αJHc traceαJH (Lζ L)αJHc Nbranchc(Hc)

These identities are merely subject to different interpretations and partitioning of the un-derlying network:

L Nbranchv(Hc)αJHc

Nvγ \ αj

αj

L Nbranchv(Hc)αJHc γ \ αj

Nv

αj

αJH

Lζ,JHc L

Kbranchv(Hc)

ζ

Zbranchv(Hc)

h∈H(Σp−2v,h)

Figure 5.7: Network corresponding to one summand for H ⊂ neighbor(c) in Eq. (5.44) (cf. Eq. (5.41)) for

a node c = v(outer)j ∈ V (outer), i.e. αj = m(Nc) ∩α = m(c).

Using the normal equation, the update N+c can explicitly be stated as

Wσ6=c N+

c = Z 6=c ζ y, (5.43)

Wσ6=c :=

H⊆neighbor(c)

ω2|H|

cHKbranchc(Hc) h∈H Σp−2

c,h. (5.44)

8These constants are not related to the mode label ζ.

Page 137: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 121

Assuming L (that is, L) exhibits a low-rank structure itself (as in Section 4.5), it can bedecomposed into a network L = Lvv∈V , denoting

L = ζw∈V Lw, L6=c := ζw∈V \cLw.

Thereby, the node Wσ6=c can be decomposed into a concordant product. It is again more

convenient to use renamed nodes L′ and N ′ as in Eq. (4.21). Let

Tv(outer)j

:= traceαj (L′v(outer)j

ζ Lv(outer)j

), j = 1, . . . , d,

Tw := L′w ζ Lw, w /∈ V (outer), (5.45)

for V (outer) given by Definition 4.1, as well as

B(LN)

c,h := ζw∈branchc(h) S(LN)

w ,

S(LN)

w := Lw Nw.

as in Section 4.5.1. Recall that by Notation 2.39, we have

Wσ6=c′ = Wσ

6=c′(γ′,γ) := Wσ

6=c((γ,γ) 7→ (γ′,γ)),

γ = m(Nc). Other nodes with duplicate mode names have their first instance renamedanalogously.

Proposition 5.25 (Product representation). We have that

Wσ6=c′ = ζh∈neighbor(c)

(B(LN)

c,h

′ ζ B(LN)

c,h +ζw∈branchc(h)Tw ∅ω2

ch(Σp−2c,h)

′) (L′c ζ Lc)

for c ∈ V .

The multiplication with Wσ6=c does hence not necessarily involve 2|neighbor(c)| summands,

but only |neighbor(c)| factors. The necessity to evaluate the product with B(LN)

c,h

′ ζ B(LN)

c,h

however increases the computational complexity compared to Section 4.5.1 (where it is mostefficient to first multiply with all components in Z6=c). Note that m(Tw)∩m(Σc,h) = ∅ forall w, h ∈ V , so the multiplication of such is essentially the ordinary tensor product.

Proof. The whole product Wσ6=c′ consists of 2|neighbor(c)| summands. Each single one corre-

sponds to a set H ⊂ neighbor(c), and is given by

ω2|H|

cHK ′branchc(Hc)

h∈H (Σp−2c,h)

′.

The product in the right-hand side of the to be shown identity fulfills

ζh∈neighbor(c)

(B(LN)

c,h

′ ζ B(LN)

c,h +ζw∈branchc(h)Tw ω2

ch(Σp−2c,h)

′)

=∑

H⊂neighbor(c)

ω2|H|

cHζh∈Hc

(B(LN)

c,h

′ ζ B(LN)

c,h

)ζh∈H

(ζw∈branchc(h) Tw (Σp−2

c,h)′)

For

K ′branchc(Hc):= ζh∈Hc

(B(LN)

c,h

′ ζ B(LN)

c,h

)ζh∈H

(ζw∈branchc(h) Tw

) (L′c ζ Lc),

we hence have to show that

K ′branchc(Hc)!= K ′branchc(Hc)

= (Zbranchc(Hc) ζ∪αJH Zbranchc(Hc))((η,η) 7→ (η′,η)),

Page 138: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

122 5.5. Stable Alternating Least Squares Tensor Recovery and Completion

where η is the duplicate mode label of Kbranchc(Hc), i.e. Kbranchc(Hc) = Kbranchc(Hc)(η,η).It is

ζh∈Hc(B(LN)

c,h

′ ζ B(LN)

c,h

)= N ′branchc(Hc)

L′branchc(Hc)ζ Lbranchc(Hc) Nbranchc(Hc).

Further,

ζh∈H(ζw∈branchc(h) Tw

)= traceαJH

(L′branchc(H) ζ Lbranchc(H)

)

and

(

traceαJH (L′branchc(H) ζ Lbranchc(H)), L′branchc(Hc)

ζ Lbranchc(Hc), L′c ζ Lc

)

= traceαJH

(L′ ζ L

).

With the last two identities, it follows

K ′branchc(Hc)= N ′branchc(Hc)

traceαJH

(L′ ζ L

)Nbranchc(Hc)

Eq. (5.42)= (Zbranchc(Hc) ζ∪αJH Zbranchc(Hc))

′,

which finished the proof.

5.5.3 Simplifications for Tensor Completion

In the special case of tensor completion, Section 4.5.2, the network L has no inner modenames. We further have Tw(ζ = i) = 1 for all i = 1, . . . ,m, w ∈ V (cf. Eq. (5.45)) andw∈V Tw = m ∈ N (the number of sampling points). Thereby, for Hc 6= ∅,

Kbranchv(Hc) = Zbranchv(Hc) ζ∪αJH Zbranchv(Hc) = ζh∈Hc(B(LN)

v,h ζ∅ B

(LN)

v,h

) (Lv ζ∅ Lv)

= (Lv ζh∈Hc B(LN)

v,h)ζ (Lv ζh∈Hc B(LN)

v,h).

Furthermore, for v = v(outer)j ∈ V (outer) and i ∈ [m]

(j,`)P (cf. Eq. (4.32)), it holds

(Lv ζh∈Hc B(LN)

v,h)(ζ = i, αj = `) = h∈HcB(LN)

v,h(ζ = i).

Another defining property of Lv = Lv(ζ, αj) is that

(Lv ζ X)(αj = `) = sumζ(X(ζ ∈ [m](j,`)P )), (5.46)

for any node X = X(ζ,γ), where ζ, αj /∈ γ. For Hc 6= ∅, we hence obtain

(Kbranchv(Hc) N+

v

)(αj = `) =

(Lv ζ (ζh∈HcB

(LN)

v,h ζ∅ (Lv ζh∈Hc B

(LN)

v,h N+v )))

(αj = `)

Eq. (5.46)= sumζ

((ζh∈HcB

(LN)

v,h)ζ∅ (Lv ζh∈Hc B(LN)

v,h N+v ))

(ζ ∈ [m](j,`)P )

Eq. (2.34)=

(ζh∈Hc B

(LN)

v,h(ζ ∈ [m](j,`)P )

)ζ(Lv(ζ ∈ [m]

(j,`)P )ζh∈Hc B

(LN)

v,h(ζ ∈ [m](j,`)P )N+

v

)

Eq. (4.34)=

(ζh∈Hc B

(LN)

v,h(ζ ∈ [m](j,`)P )

)ζ(ζh∈Hc B

(LN)

v,h(ζ ∈ [m](j,`)P )N+

v (αj = `)).

In the special case Hc = ∅ (cf. Theorem 5.12), we have

(Kbranchv(Hc) N+

v

)(αj = `) = (Lv ζ Lv N+

v )(αj = `) = |[m](j,`)P | ·N+

v (αj = `).

Page 139: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 123

If on the other hand Nv does not contain an inner mode label, then Lv ≡ 1 can be omitted.In that case, for Hc 6= ∅,

Kbranchv(Hc) N+v =

(ζh∈Hc B

(LN)

v,h

)ζ(ζh∈Hc B

(LN)

v,h N+v

).

Otherwise, Kbranchv(Hc) N+v = m ·N+

v . The branch products B(LN)

v,h and the multiplicationwith such terms can be performed efficiently as described in Section 4.5.2. The often moreefficient multiplication using Wσ

6=v as in Proposition 5.25 can be simplified similarly, and in

case of v = v(outer)j ∈ V (outer) also be handled separately for each ζ ∈ [m]

(j,`)P , ` = 1, . . . , nj ,

i.e.

(Wσ6=v′ Nv)(α′j = `)

= h∈neighbor(v)

(B(LN)

v,h

′(ζ ∈ [m]

(j,`)P )ζ B(LN)

v,h(ζ ∈ [m](j,`)P ) + 1ζ ∅

ω2

ch(Σp−2v,h)

′)

Nv(αj = `),

where 1ζ = 1ζ(ζ ∈ [m](j,`)P ) ≡ 1. The unscaled model Eq. (2.34) is significantly less technical,

and computationally less demanding. There are however arguments for these scalings, inparticular for tensor completion, as we discuss in the next section. Either model may be apreferable choice depending on the situation.

5.5.4 Fixed Points of Idealized Stable Alternating Least Squaresfor Tensors

As in Section 5.4.3, we may, albeit naively, assume that each multiplication of a matrix witha orthogonal matrices U results in the corresponding expectancy value (cf. Eq. (5.31)). Wecontinue to denote the use of this simplification with (∗). If the network N is normalizedsuch that the conditions in Theorem 5.24 are fulfilled, then with regard to Corollary 5.23,we have that

B(LN)

c,h

′ ζ B(LN)

c,h = N ′branchc(h) (L′branchc(h) ζ Lbranchc(h))Nbranchc(h)

∗=

1

chtraceαJc,h (L′branchc(h) ζ Lbranchc(h)) I ′βih

=1

chζw∈branchc(h) Tw I

′βih

where I ′βih= I ′βih

(β′ih , βih) ∈ Rrc,h×rc,h is the identity matrix with mode label βih =

m(c, h) (it hence has same modes names and size as Σ′c,h). Inserting the upper identity in

Wσ6=c′ yields

Wσ6=c′ ∗= ζh∈neighbor(c)

( 1

chζw∈branchc(h) Tw I

′βih

+ζw∈branchc(h)Tw ω2

ch(Σp−2c,h)

′) (L′c ζ Lc)

=1

cneighbor(c)

((ζw 6=cTw) (L′c ζ Lc)

)h∈neighbor(c)

(I ′βih + ω2(Σp−2

c,h)′)

=1

cneighbor(c)traceαJneighbor(c)

(L′ ζ L)h∈neighbor(c)

(I ′βih + ω2(Σp−2

c,h)′).

Let now Mσ(M)

= Mww∈V ∪ σ(M)e e∈E be the tree SVD of the sought tensor M . We

assume further (see Remark 5.20 for the reasons behind this assumption) that the sought

Page 140: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

124 5.6. The Stable ALS Method for Tree Tensor Networks

tensor M only differs from N by the node Nc, in the sense thatMw = Nw for all w ∈ V \cand σ

(M)e = σe for all e ∈ E with c /∈ e. Then we have that Mc = h∈neighbor(c)Σ

(M)c,hMc

and

Z6=c ζ y = (LN 6=c)ζ (LN 6=c h∈neighbor(c) Σ(M)c,h Mc)

∗=

1

cneighbor(c)traceαJneighbor(c)

(L′ ζ L)h∈neighbor(c) Σ(M)c,h Mc.

The summarize the situation in the following remark.

Remark 5.26 (Idealized fixed points). Under the naive assumption made in the abovediscussion, the solution to the equation Wσ

6=c N+c = Z6=c ζ y is given by

N+c∗= h∈neighbor(c)

((Iβih + ω2Σp−2

c,h)−1 Σ

(M)c,h

)Mc,

for (∗) as in Eq. (5.31).

Since the previous step, Nc = h∈neighbor(c)Σc,h Nc, the singular values of N after theupdate are hence modified versions of those of the target tensor M (at least under naiveassumptions). This adaption follows the same scheme as in the matrix case, separately foreach one of the neighboring singular values of Nc (cf. Fig. 5.4 for p = 0). So these scalingsas opposed to the unscaled model Eq. (2.34) at least allow for such particular modifications.

5.6 The Stable ALS Method for Tree Tensor Networks

The stable alternating least squares approximation, abbreviated Salsa, proceeds like ordi-

nary ALS using microsteps (M(v)L )∗, v ∈ V . We redefine the updates through the specific

representations in Theorem 5.24, by

(M(v)L )∗(N) := Nww 6=v ∪N+

v .

Now, however, separately for each v ∈ V , allow the constants ζH , H ⊂ neighbor(v), to be

chosen freely. Then, (M(v)L )∗ is uniquely extended to each all equivalent representations

N, i.e. also to those which first have be normalized in order to fulfill the requirements in

Theorem 5.24. Under suitable choices of constants s(v)w , we thereby (up to machine precision

as explained below Theorem 5.24) obtain the same micro-steps as given through the solu-tions of the variational residual function introduced in Definition 5.21. Note that it is alsopossible to (re)define the micro-steps directly on all representations, but it involves furthertechnicalities. Analogous to the matrix case (cf. Theorem 5.15) and the tensor train format(cf. [42]GrKr19), each single one of these micro-steps is representation independent and indeedstable in the sense of Definition 5.1 if the positive values ζH , H ⊂ neighbor(v), are chosenas constants which are independent of the rank (e.g. ζH ≡ 1).

Here, for each rank r ∈ NK, the data space Dr is the set of representations N with rank rand the representation map is τr(N) = v∈VNv as introduced earlier in Eq. (5.34).

For each fixed c, if N is c-orthogonal, the singular values σc,h = σc,h(βih) neighbor-ing to the node c can due to Theorem 3.11 easily be calculated through appropriate SVDsof Nc only, i.e.

σc,h = sv(N(βih )c ) = sv(N

(αJc,h )). (5.47)

Page 141: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 125

It is hence not necessary to calculate the entire tree SVD, or normal form, of N before eachupdate. As in the matrix case, every single singular value is modified (e.g. bounded frombelow) by a value σmin (cf. Section 5.3.2). In Theorem 5.24, one therefore does not use thetrue singular values, but priorly modifies σc,h(βih = j) ← max(σc,h(βih = j), σmin) forall j = 1, . . . , rc,h.

As L is assumed to be given in the same network structure as N (both correspondingto the graph G), all branch-wise evaluations can be performed efficiently as discussed inChapter 4.

5.6.1 Preconditioned, Coarse, Alternating CG

We discuss how to iteratively solve the system Eq. (5.43) based on the product representationin Proposition 5.25 by a (coarse) conjugate gradient method, similar to Section 4.4. Sincewe repeatedly have to multiply with the matrix Wσ

6=v′, each factor

w′v,h := B(LN)

v,h

′ ζ B(LN)

v,h +ζw∈branchc(h)Tw ω2

ch(Σp−2v,h)

′,

Wσ6=v′ = ζh∈neighbor(v) w

′v,h (L′v ζ Lv),

is evaluated before the CG iteration. Furthermore, the terms ζw∈branchc(h)Tw are indepen-

dent of N, and hence only need to be calculated once along with the representation L. Thebranch products B(LN)

v,h are handled efficiently as described in Chapter 4. Alternatively, de-pending on the number of neighbors of the node v, using the sum representation Eq. (5.44)can be more efficient (as for example for the tensor train format). As in Section 4.4.2, thediagonal of Wσ

6=v′ = Wσ

6=v′(γ′,γ), γ = m(Nv), is a good preconditioner and can be calculated

with neglectable computational cost. If Nv has an outer mode label αj , that is v = v(outer)j

(cf. Definition 4.1), then

diagγ(Wσ6=v′) = ζh∈neighbor(v) diagβih

(w′v,h) diagαj (L′v ζ Lv)

= ζh∈neighbor(v) diagβih(w′v,h) (L′v(α

′j 7→ αj)ζ,αj Lv)

Otherwise

diagγ(Wσ6=v′) = ζh∈neighbor(v) diagβih

(w′v,h) (L′v ζ Lv).

The diagonals of w′v,h, h ∈ neighbor(v), can be calculated explicitly, or via

diagβih(w′v,h) = B(LN)

v,h

′(β′ih 7→ βih)ζ,βih B(LN)

v,h +ζw∈branchc(h)Tw ω2

chσp−2v,h,

where the exponent in σp−2v,h is applied entrywise. The unscaled model Eq. (5.37) can

likewise be evaluated using a coarse CG method, and with lower computational complexity,in particular if the node v has more than two neighbors.

5.6.2 Semi-Implicit Rank Adaption and Practical Aspects

The same ideas based on the stability of the micro-steps that we used for matrix completioncan analogously be applied to this more general setting. In the following, we discuss fourmain aspects which provide a close to complete description of remaining, algorithmic aspectsof Salsa and in particular rank adaption.

Page 142: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

126 5.6. The Stable ALS Method for Tree Tensor Networks

• Semi-implicit rank adaption:The rank adaption follows the same scheme as discussed in [42]GrKr19 for the tensor traincase. The number of minor values (cf. Definition 5.17) for each of the tuples of singularvalues is kept constant. For certain constants fminor ≤ 1 and kminor ∈ N, we desire

|j = 1, . . . , re | 0 < (σe)j < fminor · ω| != kminor, (5.48)

for each edge e = v, h ∈ E (each corresponding to a set Je ∈ K). During the optimization,independent for each edge, the representation N is truncated or enriched after each updatesuch that Eq. (5.48) holds as equality (if the theoretical bounds towards the rank re allowssuch9, cf. Eq. (3.22)). The behavior of these singular values during a run of Salsa is shownin Fig. 5.10 for the tensor train format. Furthermore, to bound the computational complex-ity, a common upper limit to all ranks is realized, re ≤ rlim. This limit should however bechosen large enough, ensuring that the reconstruction quality is not reduced.

The factor fminor is ultimately chosen empirically, but is also related to the analysis inSection 5.4.3: for example for fminor = 1, a value σ is minor if it is smaller than ω. Insertingσ = ω in the fixed point map σ 7→ (1 + ω2σ−2)−1σ(M) yields the value (1 + ω2σ−2)−1 =12σ

(M). This is the point where the stable and repelling fixed point coincide, fstab = frep.

• Adapting the lower limit σmin:The lower limit to the singular values σmin is set as fraction fσmin

1 of the current residual

σmin := fσmin ·√d(α)√m‖L(N)− y‖, (5.49)

which assumes that the `2-norm of the operator L is close to one, as it is naturally thecase for tensor completion. The fraction is the quotient of the total number of entries ofthe tensors N (or M), d(α) = d(α1) · . . . · d(αd) = n1 · . . . · nd ∈ N, and the number ofmeasurements or sampling points m = d(ζ) ∈ N (as y = y(ζ) ∈ Rm).

• Adapting the parameter ω:Similarly to matrix completion, we assume that there is a small set of measurements orsamples yval ∈ Rmval which can be used as validations set, where y2 = Lval(M) for thetarget tensor M (in practice, this set as well as the corresponding part of the operator canpriorly be split off from the given data). We define the two residuals R and Rval in iterationnumber iter as

R(iter) := ‖L(N (iter))− y‖2, R(iter)val := ‖Lval(N

(iter))− yval‖2,

where N (iter) = τr(iter)(N(iter)) = v∈VN (iter)

v is the current iterate. As described in[42]GrKr19, the parameter ω > 0 is reduced by a factor fω after each iteration. This factor

1 < fω ∈ (f(min)ω , f

(max)ω ) is adapted after each iteration through a simple heuristic, aiming

at that

max( R(iter)

R(iter−1) ,R

(iter)val

R(iter−1)val

)!= 1 + εprogr, (5.50)

for some fixed value εprogr > 0. This adaption ensures that fω is not too large as to impairthe approximation quality, but neither so small that the required runtimes becomes unrea-sonably large.

9For example, a binary HT representation has to be initialized with rank r ≡ 2. Otherwise, some singlerank may not be increased without increasing another one as well.

Page 143: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 127

• Termination criteria:The algorithm terminates if one of the following stopping criteria is fulfilled:

• stagnation: ω σmin and fω = f(max)ω

• convergence: ω → 0 or R(iter)/‖y‖2 → 0

• early stop: R(iter)val mini<iterR

(i)val

For all implementation details and practical tweaks, we refer to the Matlab codes for thetensor train format that are publicly available as git repository salsa-implementation ordirectly at

https://git.rwth-aachen.de/sebastian.kraemer1/salsa-implementation.

5.6.3 SALSA Sweep and Algorithm

We summarize the previously carried out recursion of branches (Section 4.5.1), normaliza-tion of the network (Eq. (5.36)), calculation and perturbation of neighboring singular values(Eq. (5.47)), as well as the application of the CG method (Sections 4.4 and 5.6.1) in Al-gorithm 12. The algorithm Salsa (Algorithm 13) repeatedly applies the sweeps describedin Algorithm 12. Therein, we summarize the aspects described in Section 5.6.2. Note thatslight differences to the actual implementation nonetheless remain, and we again refer tothe Matlab implementation for further details.

Page 144: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

128 5.6. The Stable ALS Method for Tree Tensor Networks

Algorithm 12 Salsa Sweep

Input: limit σmin, parameter ω, initial guess N such that N is orthogonal with respect to the rootnode c as well as network L

Output: updated tree tensor network N and branch evaluations B after application of micro-steps(M(v)

L )∗, v ∈ V

1: procedure tsalsasweep(N,L, B, c, o, σmin, ω)2: set v ← c3: for µ = 1, . . . , |V | do4: let p = (p1, . . . , pk) be the path in G from v to oµ . analogous to Algorithm 25: for i = 1, . . . , k − 1 do6: set γ ← m(pi, pi+1) . m(pi, pi+1) = m(pi) ∩m(pi+1)7: do an SVD of Npi with respect to m(Npi) \ γ:

Npi = (U, s, Vt), s = s(γ), Vt = Vt(γ, γ)

8: set Npi ← U . Npi is now γ-orthogonal9: set Npi+1 ← diag(s) Vt Npi+1

10: set B(LN)pi+1,pi ← (Lpi Npi)ζh∈neighbor(pi)\pi+1 B

(LN)

pi,h. cf. Section 4.5.1

11: end for12: set v ← oµ . network N is again v-orthogonal

13: for h ∈ neighbor(v) do . calc sv. and establish gauge constraints for Theorem 5.2414: set γ ← m(v, h) . m(v, h) = m(v) ∩m(h)15: calculate the SVD of Nv with respect to m(Nv) \ γ: . cf. Eq. (5.47)

Nv = (U, s, Vt), s = s(γ), Vt = Vt(γ, γ)

16: set Nv ← U diag(s)17: set B(LN)

v,h ← Vt B(LN)

v,h

18: set σv,h(γ = i)← max(σmin, s(γ = i)) for i = 1, . . . , rv,h . cf. Section 5.6.219: end for20: perform coarse CG method to perform micro step (M(v)

L )∗: . cf. Section 5.6.1

Nv ← N+v

21: end for22: return updated N and B23: end procedure

Page 145: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 129

Algorithm 13 Salsa Algorithm

Input: low-rank operator representation L, measurements y ∈ Rm and hierarchical family

K (as well as a root node c ∈ V and a fixed order of nodes o = (o1, . . . , o|V |), V = oi|V |i=1,o|V | = c)

Output: approximate recovery N of M , y = L(M), in form of a network N

1: procedure tensorsalsa(K, L, y)2: initialize network N s.t. N ≡ ‖y‖1/m for r ≡ 2, corresponding to K3: set ω ← 1

2 · ‖N‖F4: orthogonalize N← ortho(N, c) . N is now c-orthogonal

5: initialize branch evaluations B(LN)

c,h = ζw∈branchc(h)(Lw Nw) for h ∈ neighbor(c)

6: split off a small validation set . denoted using Lval and yval

7: for iter = 1, 2, . . . do

8: renew lower limit σmin ← fσmin·√

d(α)√m‖L(N)− y‖ . cf. Eq. (5.49)

9: (N, B)← tsalsasweep(N, B, c, o, σmin, ω) . Algorithm 1210: decrease ω ← ω · fω . cf. Eq. (5.50)11: adapt rank w.r.t number of minor sv. . according to Eq. (5.48)12: if a stopping criterion applies then . cf. Section 5.6.213: return iterate N for which ‖Lval(N)− yval‖ was lowest . cf. Section 5.6.214: end if15: end for16: end procedure

Page 146: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

130 5.7. Numerical Experiments

5.7 Numerical Experiments

We compare the stable ALS approximation (Salsa) introduced in Section 5.6.2 with theunregularized ALS method as discussed in Section 4.5.2 for various toy problems. As thelatter algorithm is not adaptive, we equip it with a heuristic, greedy rank adaption strategy(cf. [42]GrKr19). In one example (cf. Fig. 5.9), we additionally consider the unscaled versionof Salsa, which works analogously but with micro steps as in Eq. (5.37). We call thisslight modification Salsa (due to its relation to reweighted (alternating) least squares). Thearticle [42]GrKr19 also considers a comparison to Rttc, a Riemannian optimization algorithm(cf. [94]). In all cases, we choose the tensor train format as tree network as well as uniformmode sizes n1 = . . . = nd = n ∈ N.

5.7.1 Data Acquisition and Implementational Details

Discrete tensor completion relies on a sufficient amount of sampling |[m](j,`)P | for each slice

M(αj = `), ` = 1, . . . , nj , j = 1, . . . , d. We hence generate P in a quasi-random way fora to be chosen value rP ∈ N as follows: For each µ = 1, . . . , d and each iµ = 1, . . . , nµ,we csf · r2

P times pick the remaining indices i1, . . . , iµ−1, iµ+1, . . . , id at random (uniformly).This gives in total m = |P | = csf · dnr2

P samples (excluding duplicate samples). The valuerP is artificially chosen, but interpreted as the to be expected rank of the recovery N . Thefactor csf is hence an oversampling factors, since the number of degrees of freedom of atensor N with uniform TT-ranks r ∈ N is slightly less than dnr2.

In each trial, we generate a disjoint test set C = c1, . . . , cm ⊂ Ωα in the same way, andof equal cardinality as P , together with measurements ytest ∈ Rm, (ytest)i = M(α = ci),i = 1, . . . ,m. These parts are not used during the optimization, but solely to evaluate theresults.

The sweep order is chosen as o = (1, 2, . . . , h, d, d − 1, . . . , h), h = bd/2c, although otherreasonable sweep orders yield very similar results.

We perform each task 20 times solely for different sampling and test sets. After each run ofthe respective algorithms, we save the residual Rtest and relative residual

Rtest,rel = Rtest/‖ytest‖2,

R2test := ‖Ltest(N)− ytest‖22 =

m∑

i=1

(N(α = ci)− (ytest)i)2,

where N is the chosen iterate with lowest validation residual. These normalized residualsof all 20 trials are then averaged with respect to the geometric mean, and yield the overallresult for that task. The parameter choices for Salsa are given by fminor = 1

2 , kminor = 2,

fσmin= 1

10 , mval = 15100m, (f

(min)ω , f

(max)ω ) = (1.0005, 1.1), εprogr = 0.005. These values have

in parts been chosen empirically and are recommendable for other problems. The limit rlim

varies depending on the magnitude m of the sampling set P .

5.7.2 Numerical Results

• Approximation of a tensor with near uniformly declining singular values:We consider the completion of the tensor D = D(α) ∈ Rn×...×n defined through

D(α = i) :=(1 +

d−1∑

µ=1

iµiµ+1

)−1, i = (i1, . . . , id), iµ = 1, . . . , n, µ = 1, . . . , d. (5.51)

Page 147: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 131

The modes are naturally ordered and the tensor has uniformly, exponentially decaying sin-gular values. It can therefore very well be approximated with uniform ranks, but this is nottrivial for an adaptive algorithm to recognize. So we expect that plain (greedy adaptive)ALS can barely be outperformed. The results are plotted in Fig. 5.8.

1 2 3 4

·104

10−5

10−4

10−3

sampling size

relativeresidual

n = 12

0 2 4 6

·104

10−4

10−3

sampling size

n = 20

SALSA (d = 6)

SALSA (d = 9)

SALSA (d = 15)

ALS (d = 6)

ALS (d = 9)

ALS (d = 15)

Figure 5.8: (d = 6, 9, 15, rP = 6, rlim = 14, n = 12, 20, csf = 2, 4, 6) Plotted are, for the tensor D(Eq. (5.51)), for varying dimension and mode size, the averaged relative test residuals as functions of thesampling size m = |P | as result of each 20 trials, for ALS (black) and Salsa (blue, filled symbols). Themarkers are exact; the intermediate lines are shape-preserving piecewise cubic Hermite interpolations ofsuch.

Despite the comparatively easy task, Salsa still slightly outperforms plain ALS. The ad-ditional regularization introduced in Salsa might have more benefits other than just rankadaption.

• Approximation of three generic tensors with non uniform sets of singular values:The function based tensors we consider here are not subject to an ordering of modes asdesired for the tensor train format. We define the multivariate functions

f (1)(i1, . . . , i8) :=i14

cos(i3 − i8) +i2

2

i1 + i6 + i7+ i5

3 sin(i6 + i3),

f (2)(i1, . . . , i7) :=

(i4

i2 + i6+ i1 + i3 − i5 − i7

)2

,

f (3)(i1, . . . , i11) :=

√i2 + i3 +

1

10(i4 + i5 + i7 + i8 + i9) +

1

20(i1 − i6 − i10 + i11)2.

These in turn define tensors F (k) = F (k)(α) ∈ Rn×...×n given through F (k)(α = i) = f (k)(i),iµ = 1, . . . , n, µ = 1, . . . , dk, k = 1, 2, 3, for varying dimensions d1 = 8, d2 = 7 and d3 = 11.The results are shown in Fig. 5.9.

Page 148: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

132 5.7. Numerical Experiments

0.4 0.7 1 1.3

·104

10−6

10−5

10−4

10−3

10−2

sampling size

relativeresidual

f(1)

0.4 0.6 0.8 1 1.2

·104

10−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

sampling size

f(2)

0.6 1 1.4 1.8

·104

10−4

10−3

10−2

sampling size

f(3)

RWALS

SALSA

ALS

Figure 5.9: (d1 = 8, d2 = 7, d3 = 11, rP = 6, rlim = 10, n = 8, csf = 2, 4, 6) Plotted are, for the tensorsF (1) (left), F (2) (middle) and F (3) (right), the averaged relative test residuals and shadings proportionalto the standard deviations as functions of the sampling size m = |P | as result of each 20 trials, for ALS(black), Salsa (blue, filled symbols) and Salsa (blue, crosses). The markers are exact; the intermediatelines are shape-preserving piecewise cubic Hermite interpolations of such.

1 2 3 4 5 6 7

100

105

res

σminω

singularvalues

.

1 2 3 4 5 6 7

res

σminω

.

1 2 3 4 5 6 7

res

σminω

Figure 5.10: Plot of TT-singular values for the completion of the tensor F (1), in iteration iter = 76

(left), iter = 93 (middle) and iter = 110 (right). Displayed are in each column the entries of σ(µ)TT (cf.

Section 3.4.1), µ = 1, . . . , d1 − 1, in a logarithmic scale, separated into stable singular values (blue, filled)and minor ones (teal). Note that the parameter ω slowly declines.

The quality of reconstruction for the first and third function differs by orders of magnitude.A closer inspection of the approximation reveals that plain ALS often gets stuck at a pointwhere it frequently, but ineffectively increases ranks. The difference for the second functionis smaller, yet the variance of results is reduced drastically. The unscaled version Salsa ismarginally worse than Salsa, which is not surprising due to the homogeneity of the sam-pling operator. In Section 6.4, we observe larger deviations between the continuous versionsof these two algorithms.

• Completion of random low-rank tensors:We next consider the recovery of quasi-random tensors with (exact) low ranks. The TT-ranks

r = (r1, . . . , rd−1) are generated randomly (uniformly), but it is ensured that 1d−1

∑d−1µ=1 rµ ≥

2/3k and max(r) ≤ k (cf. Definition 3.18) for varying k ∈ N. Each tensor is generated via aTT representation G = dµ=1Gµ, where we assign to each entry of each node G1, . . . , Gd auniformly distributed value in [−0.5, 0.5]. Subsequently, we apply the alternating projectionmethod (Algorithm 14) until each entry in the TT-singular values σ(1), . . . , σ(d−1) equals arandom value in [0, 1] (up to common scaling factor). As results, we plot the number ofsuccessful recoveries Rtest,rel < 10−5 for varying mode sizes n, dimensions d and maximalrank k of the target tensors (Figs. 5.11 and 5.12).

Page 149: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

5. Stable ALS for Rank-Adaptive Tensor Approximation and Recovery 133

10 3 0 0

17 7 2 0

16 15 6 2

20 19 8 10

20 19 19 11

20 20 19 17

10 3 0 0

17 7 2 0

16 15 6 2

20 19 8 10

20 19 19 11

20 20 19 17

8 12 16 20

248

163264

mode size n

samplingfactor

d = 5, ALS

20 10 2 0

20 20 15 4

20 20 19 18

20 20 20 20

20 20 20 20

20 20 20 20

20 10 2 0

20 20 15 4

20 20 19 18

20 20 20 20

20 20 20 20

20 20 20 20

8 12 16 20

d = 5, SALSA

1 0 0 0

3 0 0 0

9 0 0 0

12 2 0 0

19 8 0 2

18 12 4 3

1 0 0 0

3 0 0 0

9 0 0 0

12 2 0 0

19 8 0 2

18 12 4 3

8 12 16 20

d = 6, ALS

4 0 0 0

16 0 0 0

20 6 0 0

20 18 6 0

20 20 17 9

20 20 20 16

4 0 0 0

16 0 0 0

20 6 0 0

20 18 6 0

20 20 17 9

20 20 20 16

8 12 16 20

d = 6, SALSA

0 0 0 0

0 0 0 0

3 0 0 0

2 0 0 0

8 0 0 0

14 1 0 0

0 0 0 0

0 0 0 0

3 0 0 0

2 0 0 0

8 0 0 0

14 1 0 0

8 12 16 20

d = 7, ALS

1 0 0 0

1 0 0 0

10 0 0 0

20 2 0 0

20 5 0 0

20 17 2 0

1 0 0 0

1 0 0 0

10 0 0 0

20 2 0 0

20 5 0 0

20 17 2 0

8 12 16 20

d = 7, SALSA

0

5

10

15

20

Figure 5.11: (d = 5, 6, 7, rP = 6, rlim = 9, n = 8, 12, 16, 20, csf = 2, 4, 8, 16, 32, 64) Displayed as 20 shadesof blue (black (0) to white (all 20)) are the number of successful reconstructions for random tensors withmaximal rank k = 6 for ALS and Salsa.

5 0 0

12 5 2

17 11 4

18 19 9

20 18 17

20 20 18

5 0 0

12 5 2

17 11 4

18 19 9

20 18 17

20 20 18

12 16 20

248

163264

mode size n

samplingfactor

d = 5, ALS

19 7 3

20 19 14

20 20 20

20 20 20

20 20 20

20 20 20

19 7 3

20 19 14

20 20 20

20 20 20

20 20 20

20 20 20

12 16 20

d = 5, SALSA

0 0 0

1 0 0

1 0 0

7 1 0

16 3 1

17 11 2

0 0 0

1 0 0

1 0 0

7 1 0

16 3 1

17 11 2

12 16 20

d = 6, ALS

0 0 0

7 1 0

18 4 0

20 18 7

20 20 16

20 20 20

0 0 0

7 1 0

18 4 0

20 18 7

20 20 16

20 20 20

12 16 20

d = 6, SALSA

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

2 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

2 0 0

12 16 20

d = 7, ALS

0 0 0

0 0 0

0 0 0

6 1 0

17 2 0

20 12 2

0 0 0

0 0 0

0 0 0

6 1 0

17 2 0

20 12 2

12 16 20

d = 7, SALSA

0

5

10

15

20

Figure 5.12: (d = 5, 6, 7, rP = 8, rlim = 11, n = 12, 16, 20, csf = 2, 4, 8, 16, 32, 64) Displayed as 20 shadesof blue (black (0) to white (all 20)) are the number of successful reconstructions for random tensors withmaximal rank k = 8 for ALS and Salsa.

While in both cases higher dimensions require more sampling, Salsa overall needs about 4to 8 times as few sampling as plain ALS to achieve the same results.

• Completion of the rank adaption tensor tensor:We consider the recovery of the tensor defined in Example 5.4 which is specifically designedto make rank adaptation difficult. There are certain freely selectable parts of it, which wechose randomly in each trial. The results are displayed in Fig. 5.13.

0 0

1 2

5 2

12 4

18 7

19 12

0 0

1 2

5 2

12 4

18 7

19 12

12 20

248

163264

mode size n

samplingfactor

k = 2, ALS

4 0

9 2

16 3

20 14

20 17

20 20

4 0

9 2

16 3

20 14

20 17

20 20

12 20

k = 2, SALSA

3 0

3 0

12 1

19 5

20 9

20 18

3 0

3 0

12 1

19 5

20 9

20 18

12 20

k = 3, ALS

13 1

18 3

20 16

20 19

20 20

20 20

13 1

18 3

20 16

20 19

20 20

20 20

12 20

k = 3, SALSA

4 0

6 0

6 0

18 2

18 15

20 14

4 0

6 0

6 0

18 2

18 15

20 14

12 20

k = 4, ALS

18 7

20 14

20 20

20 20

20 20

20 20

18 7

20 14

20 20

20 20

20 20

20 20

12 20

k = 4, SALSA

0

5

10

15

20

Figure 5.13: (d = 6, rP = 2k, rlim = 2k + 3, k = 2, 3, 4, n = 12, 20, csf = 2, 4, 8, 16, 32, 64) Displayedas 20 shades of blue (black (0) to white (all 20)) are the number of successful reconstructions for the rankadaption test tensor with rank r = (k, k, k, 1, 2k) for ALS and Salsa.

With increasing maximal rank k (and sampling size m since rP = 2k), the difference betweenSalsa and plain ALS gets more noticeable, as rank adaption becomes much harder withlarger k.

Page 150: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 151: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 6

Approximate Interpolation ofHigh-Dimensional, ScatteredData

We have so far only considered situations in which a discrete tensor is to be recovered, wherethere may not be a topological interpretation behind the data. In many applications however(as in Section 6.3), in particular those related to physics, we are handed measurements thatstem from variations of continuous parameters, which are most likely not contained in aregular grid. Mathematically speaking, ones desires the recovery of multivariate functions

u = u(α) ∈ Hα ⊂ L2(Ωα), Ωα := Ωα1 × . . .× Ωαd ,

for bounded intervals Ωαµ ⊂ R, µ = 1, . . . , d, and m ∈ N measurements y = y(ζ) = L(u) ∈Hζ := Rm. In particular, we discuss the situation in which L is a sampling operator (cf.Section 4.5.2) and the sought solution is continuous, Hα ⊂ C0(Ωα).

One approach followed in [44] is to find the best least squares fit to the sampling pointswithin a finite-dimensional tensor subspace

H(h)α := H(h1)

α1⊗ . . .⊗ H(hd)

αµ ⊂ Hα, H(hµ)αµ ⊂ L2(Ωαµ) ∩ C0(Ωαµ),

for µ = 1, . . . , d, subject to low-rank constraints. Thereby, the problem setting becomesthe same as Section 4.5. One drawback however is that at least d discrete parametershµ = dim(Hαµ) ∈ N, µ = 1, . . . , d, have to be determined. This is because the qualityof approximation, without further regularization, depends strictly on the suitability and inparticular the dimensions of these finite subspaces.

In [39] it is shown how to efficiently compute a high-dimensional, tensorized Chebyshevinterpolation. This method however requires access to the function values of u on a subsetof points within a specific grid, which might not be available. In that sense, it belongs tothe same class of algorithms as cross approximation.

We first discuss the given completion problem within the infinite-dimensional setting, forHα ⊂ W 2,2(Ωα) (cf. Section 6.2), where we assume both low rank and smoothness of thesolution. The second order Sobolev space W 2,2(S) ⊂ C1(S) ⊂ L2(S) ⊂ RS , S ⊂ R, is theHilbert space of twice weakly differentiable, square-integrable functions.

135

Page 152: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

136 6.1. Thin-Plate Splines

6.1 Thin-Plate Splines

We consider a particular form of smoothness related to so-called thin-plate splines. Eachtensor f ∈ Hα is treated as function depending on variables α1, . . . , αd (cf. Section 2.4). Wedefine

RTP : Hα → R≥0, RTP(f)2 :=

d∑

µ,ν=1

Ωα

(∂2

∂αµ∂ανf)2. (6.1)

This map is invariant under rotation, and denoted as thin-plate regularizer. One of theearliest works considering an approximation with respect to such is [25]. We no longerassume that the sampling P = p1, . . . , pm ⊂ Ωα is contained in a regular grid as previouslyfor discrete tensors, but that it is (in principle) arbitrarily scattered across the domain. Thesampling operator L := (·) 7→ (·)|P evaluates a function on these points,

(f |P ) ∈ Rm, (f |P )i := f(pi) = f(α = pi).

For dimensions d > 3, this operator is however not well defined since then Hα * C0(Ωα).In the other cases, d ≤ 3, if Ωαµ = R, µ = 1, . . . , d, the minimizer

f∗ = argminf∈Hα

m∑

i=1

(f(pi)− yi)2 + λ2 ·RTP(f)2,

for fixed λ > 0, can in fact be computed by solving only one linear system (as for exampledescribed in [26]). We are however particularly interested in the higher-dimensional cases,and bounded intervals Ωαµ ( R, µ = 1, . . . , d. The operator RTP is further discussed inSection 6.1.2.

6.1.1 Approximation under Low-Rank Constraints

We only briefly outline the properties of the given problem regarding involved norms, unique-ness and low-rank constraints. Further details on the (tensor) singular value decompositionin infinite-dimensional spaces, in particular Sobolev spaces, can be found in [1, 96] and ofcourse [46].

As for discrete tensors, we assume that the solution u is (approximately) low-rank,

u ∈ Tr,K(Hα) ⊂ W := Hα1 ⊗a . . .⊗a Hαd , Hαµ := W 2,2(Ωαµ), µ = 1, . . . , d,

for a family K which yields a tree G = (V,E, L) with singleton legs αj = m(v(outer)j ) and

a rank r ∈ NK (cf. Section 3.3.1). Similar to Section 4.5, the problem setting changes to thetask to find

f∗ = argminf∈Hα

m∑

i=1

(f(pi)− yi)2 + λ2 ·RTP(f)2, subject to rankαJ (f) ≤ r(J), J ∈ K. (6.2)

The sampling points are well defined values since⋃

r∈NKTr,K(Hα) =W ⊂ C0(Ωα).

In addition, the sampling operator L : Hα → Rm is a linear and continuous map. Theinduced norm ‖ · ‖α on W is however stronger than the following norm

‖ · ‖2,2 := ‖ · ‖W 2,2(Ωα) = ‖ · ‖L2 + | · |W 1,2 + | · |W 2,2 .

Page 153: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 137

As, furthermore, the space W is dense in W 2,2(Ωα) (in ‖ · ‖2,2), we have

W‖·‖α (W‖·‖2,2 = W 2,2(Ωα).

Thus, for any function g ∈ W 2,2(Ωα) and ε > 0, there is a rank r ∈ NK and a functionf ∈ Tr,K(Hα) such that ‖f −g‖2,2 < ε. Even thoughW is not closed under the norm ‖ ·‖2,2,each space Tr,K(Hα) is for fixed r ∈ NK. Furthermore, the function RTP(·) is a norm onthe quotient space of functions that have constant difference. Additionally, on that space,RTP(·) is equivalent to the norm ‖ · ‖2,2.

The problem setting Eq. (6.2) is thereby well-defined, and a minimizer f∗ indeed exists,although its uniqueness cannot be guaranteed based on the previous considerations, andneither can the question be answered whether f∗ = u.

6.1.2 Decomposition of the Thin-Plate Regularizer

The regularizer can be restated as symmetric, continuous, bilinear form B ∈ bil0(Hα,Hα),

RTP(f)2 = B(f, f), B : Hα × Hα → R, (6.3)

B(h, g) :=

d∑

µ,ν=1

Ωα

∂2

∂αµ∂ανh · ∂2

∂αµ∂ανg,

which is uniquely extendable to the domain W 2,2(Ωα) ×W 2,2(Ωα) (cf. Lemma 2.3). We

define three bilinear forms Iαµ , dαµ , d(2)αµ ∈ bil0(Hαµ ,Hαµ) for each univariate space Hαµ =

W 2,2(Ωαµ), µ = 1, . . . , d,

Iαµ(h, g) := 〈g, h〉L2=

Ωαµ

g · h,

dαµ(h, g) := 〈g′, h′〉L2 =

Ωαµ

g′ · h′, d(2)αµ (h, g) := 〈g′′, h′′〉L2 =

Ωαµ

g′′ · h′′.

The bilinear form B is element of an algebraic tensor product space itself,

B ∈ Bil0(Hα1 ,Hα1)⊗a . . .⊗a Bil0(Hαd ,Hαd),

since it can be written as the finite sum of elementary products

B = d(2)α1⊗ Iα2

⊗ . . .⊗ Iαd + d(1)α1⊗ d(1)

α2⊗ . . .⊗ Iαd + . . .+ Iα1

⊗ . . .⊗ Iαd−1⊗ d(2)

αd.

With the notation discussed in Section 2.8.1, B may also be decomposed into a tensornetwork B = Bvv∈V . In particular, its tensor train decomposition structure is as follows:

B = B(α,α) = B1 . . . Bd, (6.4)

B1 = B1(α1, α1, ε1) ∈ Bil0(Hα1,Hα1

)⊗a R3,

Bµ = Bµ(εµ−1, αµ, αµ, εµ) ∈ R3 ⊗a Bil0(Hαµ ,Hαµ)⊗a R3,

Bd = Bd(εµ−1, αd, αd) ∈ R3 ⊗a Bil0(Hαµ ,Hαµ).

Page 154: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

138 6.1. Thin-Plate Splines

The entries of the nodes B1, . . . ,Bd are (under slight abuse of notation for better readability)given by

B1(ε1 ∈ 1, 2, 3) =(Iα1

√2dαµ d

(2)αµ

),

Bµ((εµ−1, εµ) ∈ 1, 2, 3 × 1, 2, 3) =

Iα1

√2dαµ d

(2)αµ

0 Iα1

√2dαµ

0 0 Iα1

,

Bd(εd ∈ 1, 2, 3) =

d(2)αµ√

2dαµIα1

.

This decomposition is very similar to the one of the multidimensional Laplace operator

4d = 41 ⊗ . . .⊗ id + . . .+ id⊗ . . .⊗41 ∈ hom0(Hα, L2(Ωα)),

where for simplicity 4s is the Laplacian on any tensor product space of order s. Its tensortrain decomposition is given as

4d = ϕ1 . . . ϕd,ϕ1 = ϕ1(α1, α1, ε1) ∈ hom0(Hα1 , L2(Ωα1))⊗a R2,

ϕµ = ϕµ(εµ−1, αµ, αµ, εµ) ∈ R2 ⊗a hom0(Hαµ , L2(Ωαµ))⊗a R2,

ϕd = ϕd(εµ−1, αd, αd) ∈ R2 ⊗a hom0(Hα1, L2(Ωαd)).

for

ϕ1(ε1 ∈ 1, 2) =(id 41

),

ϕµ((εµ−1, εµ) ∈ 1, 2 × 1, 2) =

(id 41

0 id

),

ϕd(εd ∈ 1, 2) =

(41

id

).

Just like the TT-ranks of the Laplacian are hence all 2, the TT-ranks of B equal 3, i.e.rankαJ]αJ = 3, for J = 1, . . . , µ, µ = 1, . . . , d− 1. It follows that this in fact holds for allsets:

Theorem 6.1 (Ranks of thin-plate regularizer). Let B = B(α,α) be the thin-plateregularizer as defined by Eq. (6.3). It holds that

rankαJ]αJ (B) = 3,

for all non empty subsets ∅ 6= J ( 1, . . . , d (cf. Definition 2.31).

Proof. Follows simply due to the symmetry of the bilinear form B and its rank 3 tensor traindecomposition as in Eq. (6.4).

For any tree tensor network corresponding to a hierarchical family K, the exact decompo-sition can be calculated explicitly. We however skip this part as it only becomes increasinglytechnical, but instead refer to the Matlab implementation for further details.

Page 155: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 139

6.2 Discretization and Alternating Least Squares

We can rephrase the problem setting using the bilinear operator B,

find argminf∈Hα

m∑

i=1

(f(pi)− yi)2 + λ2 · B(f, f), subject to rankαJ (f) ≤ r(J), J ∈ K.

In theory, one may solve this problem using an infinite-dimensional version of the algorithmicprocess discussed in Chapter 4 (where instead of f , the unknown is named N) based on atree decomposition for a graph G = (V,E) corresponding to K,

f = v∈V fv.

In particular, Eq. (4.16) shows how to handle the regularizer part B(·, ·) (therein denotedas A), while Section 4.5.2 describes how to fit the function to the sampling data. Certainly,this is not possible in practice, so some form of discretization is required. In order to utilizethe decomposition of f , we have to discretize each univariate space Hαµ = W 2,2(Ωαµ),µ = 1, . . . , d. While Legendre polynomials are a proper choice, we use subspaces which arespecifically adapted to a regularization of the second derivative.

6.2.1 Monovariate Kolmogorov Subspaces

As the bilinear form B penalizes high second order derivatives, reasonable choices are theunivariate subspaces that, for each fixed µ = 1, . . . , d, realize the Kolmogorov widths of

g ∈W 2,2(Ωαµ) | ‖g′′‖L2≤ 1,

as carried out in [67]. We derive a basis for such in a slightly different way and then provein Corollary 6.4 that these span the subspaces which we sought.

Definition 6.2 (Kolmogorov subspaces). Let S = [−1, 1] ⊂ R. We successively define thesequence fk ∈W 2,2(S), k ∈ N, by

‖f ′′k ‖L2(S) = infg∈W 2,2(S) ‖g′′‖L2(S)

subject to: g ⊥ span(f1, . . . , fk−1), ‖g‖L2(S) = 1

for the standard L2 scalar product, such that

S

fifj = δij :=

1 for i = j

0 otherwise.

We denote fkKk=1 as K, the Kolmogorov basis of (by context given) dimension K.

The functions f1 and f2 are not unique, as they span the set of affine functions. Withregard to all subsequent ones, we define these as the first symmetric and skew-symmetricLegendre functions,

f1(x) =1√2, f2(x) =

√3

2x.

All subsequent functions are unique, and solutions to a simple differential equation (cf.Eq. (6.7)). Explicitly, they are sums of trigonometric functions:

Page 156: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

140 6.2. Discretization and Alternating Least Squares

Theorem 6.3. The K functions, for k > 2, are given (in correct order and normalized) by

√1 +

cos(ν1+2i)2

cosh(ν1+2i)2· f1+2i(x) = cos(ν1+2i x) +

cos(ν1+2i)

cosh(ν1+2i)· cosh(ν1+2i x),

√1− sin(ν2+2i)2

sinh(ν2+2i)2· f2+2i(x) = sin(ν2+2i x) +

sin(ν2+2i)

sinh(ν2+2i)· sinh(ν2+2i x),

for i ∈ N. Each value ν1+2i and ν2+2i are the i-th lowest (positive) root of

tc := cos(ν) sinh(ν) + cosh(ν) sin(ν), ts := cos(ν) sinh(ν)− cosh(ν) sin(ν),

respectively. Furthermore, the second derivatives are orthogonal as well,

∫ 1

−1

f ′′i f′′j = δij ν

4i ,

for δij as defined above.

The K functions are hence infinitely smooth, i.e. fk ∈ C∞(S) (cf. Fig. 6.1), and thesecond derivates are orthogonal, but not orthonormal.

−1 −0.5 0 0.5 1

−1

0

1

Ω

Symmetric functions

−1 −0.5 0 0.5 1

−1

0

1

Ω

Skew-symmetric functions

Figure 6.1: Displayed are the functions f3, f5, f7 (left) and f4, f6, f8 (right) of the K basis.

Proof. The initial part of the proof is based on [59], which suggests to apply Lagrangemultipliers and partial integration. So in order to obtain fk, k ≥ 3, we search for a stationarypoint of the Lagrangian

L(k)(g, λ(k), µ(k)) =

∫ 1

−1

|g′′|2 +

k−1∑

i=1

λ(k)i

∫ 1

−1

gfi + µ(k)

(∫ 1

−1

|g|2 − 1

),

for multipliers λ(k)1 , . . . , λ

(k)k−1 and µ(k). The derivatives with respect to λ1, . . . , λk−1 give the

orthogonality conditions, whereas µ ensures the normalization. Variation of the function gyields the optimality condition

2

∫ 1

−1

g′′h′′ +k−1∑

i=1

λ(k)i

∫ 1

−1

hfi + 2µ(k)

∫ 1

−1

gh = 0

Page 157: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 141

for all h ∈ C∞. Using partial integration twice1, we obtain

2

∫ 1

−1

g(4)h+ 2g(2)h|1−1 − 2g(3)h|1−1 +

k−1∑

i=1

λ(k)i

∫ 1

−1

hfi + 2µ(k)

∫ 1

−1

gh = 0.

The variational principle now yields the pointwise identity

P(k)(g) := 2g(4) +

k−1∑

i=1

λ(k)i fi + 2µ(k)g = 0, (6.5)

as well as the boundary conditions

g(2)(±1) = 0, g(3)(±1) = 0.

We inductively proof now that λ(k)i ≡ 0, i = 1, . . . , k − 1, for k ≥ 3. For j < k, by

orthonormality conditions, we can conclude

0 =

∫ 1

−1

P(k)(g)fj = 2

∫ 1

−1

g(4)fj +

k−1∑

i=1

λ(k)i

∫ 1

−1

fifj + 2

∫ 1

−1

µ(k)gfj

= 2

∫ 1

−1

g(4)fj + λ(k)j .

Using the boundary conditions for both f and g, this is equivalent to

0 =

∫ 1

−1

gf(4)j + λ

(k)j . (6.6)

In the case of k = 3, both f(4)1 ≡ f (4)

2 ≡ 0. Hence, λ(3)j = 0, j = 1, 2, and Eq. (6.5) simplifies

to

g(4) + µ(k)g = 0.

As further induction hypothesis, we now assume that f(4)j = −µ(j)fj , j = 1, . . . , k−1. Then,

again by orthogonality constraints towards g and fj , Eq. (6.6) simplifies to

0 =

∫ 1

−1

gf(4)j + λ

(k)j = −µ(j)

∫ 1

−1

gfj + λ(k)j = λ

(k)j .

We can hence conclude that λ(k)j = 0 for all j = 1, . . . , k − 1 for k ≥ 3. Thereby, for such

k ≥ 3, fk is the one normalized solution of the simple, linear differential equation

f(4)k + µ(k)fk = 0, f

(2)k (±1) = 0, f

(3)k (±1) = 0. (6.7)

This finishes the induction. Searching for its solutions, fk can be assumed to be eithersymmetric or skew-symmetric. If neither was the case, then due to the symmetry of theproblem, g(x) := fk(−x) fulfills the same differential equation (for the same µ(k)). Hence,g + fk and g − fk are symmetric and skew-symmetric, respectively, fulfill the differentialequation and span the same space as g and fk. For simplicity, we substitute ν = 4

√−µ(k).

The unscaled symmetric and skew-symmetric solutions are given by

cos(νx) + a cosh(νx), sin(νx) + b sinh(νx),

1Strictly speaking, at this points we need to assume f to be sufficiently smooth. Corollary 6.4 howeversubsequently confirms that the functions fi, i ∈ N, are those which we seek.

Page 158: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

142 6.2. Discretization and Alternating Least Squares

respectively. The boundary conditions yield the matrix equations

(− cos(ν) cosh(ν)sin(ν) sinh(ν)

)(1a

)= 0,

(− sin(ν) sinh(ν)− cos(ν) cosh(ν)

)(1b

)= 0.

The points at which either matrix becomes singular, are the distinct roots r(i)c and r

(i)s of

tc := cos(ν) sinh(ν) + cosh(ν) sin(ν), ts := cos(ν) sinh(ν)− cosh(ν) sin(ν),

respectively, i ∈ N (both have infinitely many (single) roots). We therefor define ν1 = ν2 =

0, ν1+2i = r(i)c and ν2+2i = r

(i)s . After some calculus, we obtain

√1 +

cos(ν1+2i)2

cosh(ν1+2i)2f1+2i = cos(ν1+2ix) +

cos(ν1+2i)

cosh(ν1+2i)cosh(ν1+2ix)

√1− sin(ν2+2i)2

sinh(ν2+2i)2f2+2i = sin(ν2+2ix) +

sin(ν2+2i)

sinh(ν2+2i)sinh(ν2+2ix), i ∈ N.

The orthogonality conditions are easy to verify, considering

∫ 1

−1

fifj =1

νi4

∫ 1

−1

f(4)i fj =

1

νi4

∫ 1

−1

f ′′i f′′j =

1

νi4

∫ 1

−1

fif(4)j =

ν4j

νi4

∫ 1

−1

fifj

since either νj < νi or νj = 0, j = 1, 2, for j < i. Furthermore, this yields

∫ 1

−1

f ′′i f′′j = δijν

4i . (6.8)

The functions fk are hence also in correct order.

The roots ν1+2i and ν2+2i can be approximated by

−π/4 + iπ and π/4 + iπ, (6.9)

respectively, since for k > 7, these equal the true i-th roots up to 20 decimal places. Earliervalues are easily found numerically, and can be saved once computed. For all j ∈ N, itfurther holds vj+1 >

π2 (j − 1).

Corollary 6.4 (Optimality). Let g ∈ W 2,2(S) and let fnn∈N be the K functions. Wedefine the Fourier coefficients an = 〈g, fn〉L2

. Then

‖g − gk‖2,2 →k→∞

0, for gk :=

k∑

n=1

anfn.

Furthermore,

‖g − gk‖L2 ≤1

v2k+1

‖g′′‖L2 . (6.10)

Each constant ck := 1v2k+1

equals the Kolmogorov n-width of the space f ∈ W 2,2(S) |‖f ′′‖L2

≤ 1, such that the K basis is, in that sense, optimal.

Note that this also implies uniform, pointwise convergence.

Page 159: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 143

Proof. Because of the orthogonality condition Eq. (6.8), 1v2nf ′′nn∈N is a set of orthonormal

functions. Due to the differential equation Eq. (6.7), which each fn fulfills, we have

〈g′′, 1

ν2n

f ′′n 〉L2= 〈g, 1

ν2n

f (4)n 〉L2

= 〈g, ν2nfn〉L2

= ν2nan. (6.11)

It thereby follows that

g′′k =

k∑

n=1

anf′′n =

k∑

n=1

〈g′′, 1

ν2n

f ′′n 〉L2

1

ν2n

f ′′n .

So ν2nan = 〈g′′, 1

ν2nf ′′n 〉L2 are the Fourier coefficients of g′′ with respect to 1

v2nf ′′nn∈N. Hence,

ν2nan → 0 and both gk and g′′k converge in L2. Let now h := g−g∞, g∞ := limk→∞ gk (in theL2 norm). Then 〈h, fn〉L2

= 0 for all n ∈ N by construction. Since νn > ‖h′′‖L2/‖h‖L2

, forsome n ∈ N, the normalized function h/‖h‖L2

must, by definition of the K basis, be identicalto some fj , j ∈ N, which is a contradiction. Therefore, h ≡ 0 almost everywhere. As shownabove, the second derivative g′′ converges in L2. Hence, the sequence gk convergences to galso in the W 2,2(S) norm ‖ · ‖2,2.

To prove the approximation property Eq. (6.10), we consider that

‖g′′‖2L2≥

∞∑

n=k+1

ν4na

2n ≥ ν4

k+1

∞∑

n=k+1

a2n.

Hence

‖g − gk‖2L2=

∞∑

n=k+1

a2n ≤

1

ν4k+1

‖g′′‖2L2.

In order to show the optimality of ck, let Sk ⊂W 2,2([−1, 1]) be any k-dimensional subspaceother than span(f1, . . . , fk). Then there is a function g 6= fk+1, ‖g‖L2

= 1, such that

g ∈ span(f1, . . . , fk+1) ∩ S⊥L2

k .

For every gk ∈ Sk, as g =∑k+1n=1 anfn ∈ S

⊥L2

k , we have

‖g − gk‖L2 = ‖g‖L2 + ‖gk‖L2 ≥ 1.

Further, due to Eq. (6.11), it is

‖g′′‖2L2=

k+1∑

n=1

a2nν

4n < ν4

k+1,

since∑k+1n=1 a

2n ≤ 1 and ak+1 < 1. Hence,

‖g − gk‖L2≥ 1 >

1

ν2k+1

‖g′′‖2L2,

which shows that the subspace Sk is not optimal.

Page 160: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

144 6.2. Discretization and Alternating Least Squares

6.2.2 Discretized Operators and Problem Setting

Let

H(hµ)αµ := span(fkhµk=1) ⊂W 2,2(Ωαµ), µ = 1, . . . , d,

be the span of the Kolmogorov basis Khµ of dimension hµ ∈ N. We have only derived thesefunctions for Ωαµ = [−1, 1], but either the underlying data can be transformed to fit thisinterval, or the domain of each fk : [−1, 1]→ R, k ∈ N, can be adapted. This discretizationyields the tensor product space of finite dimension h1 · . . . · hd,

H(h)α := H(h1)

α1⊗ . . .⊗ H(hd)

αd⊂ W.

These spaces are isomorphic to standard Euclidean spaces,

Hα(h) = Rh1×...×hd ∼= H(h)α , H

α(h)µ

:= Rhµ ∼= H(hµ)αµ , µ = 1, . . . , d,

for which we introduce the mode label (set) α(h) = α(h)1 , . . . , α

(h)d . As indicated in Sec-

tion 2.8.1, the operator L and the bilinear form B can be restricted to these finite sub-spaces and their isomorphic Euclidean spaces, while retaining their network structure. Let

φµ : Hα

(h)µ→ H

(hµ)αµ be isometries, i = 1, . . . , hµ, with respect to a suitable norm. These

induce the isometry, with respect to an induced norm,

φ : Hα(h) → H(h)α , φ := φ1 ⊗ . . .⊗ φd.

As introduced in Section 2.8.1, we may as well assign mode labels to functions, such asφµ = φµ(α,α(h)). Let now L = L(ζ,α) = v∈V Lv be the decomposition of the samplingoperator. Then its discretized version

L(h) = L(h)(ζ,α(h)) := L φ = L φ : Hα(h) → Rm

has an equivalent decomposition, L(h) = v∈V L(h)v , for

L(h)

v(outer)j

:= Lv(outer)j

φj = Lv(outer)j

φj , j = 1, . . . , d,

and L(h)w := Lw for w /∈ V (outer). The operator L is a (continuous) Hilbert-Schmidt operator,

and its action may be represented by L ∈ Rm⊗Hα. While L(h) can likewise be representedby some

L(h) = L(h)(ζ,α(h)) = v∈V L(h)v ∈ Rm×d(α(h)), d(α(h)) =

j∈Dhj ,

it depends on the norm assigned to H(h)α whether its singular values remain bounded for

increasing h. The single nodes of this representation (for the sampling operator) are giventhrough simple point evaluations of single, univariate functions, given the unit vectors ek,k ∈ N:

L(h)

v(outer)j

(ζ = i, αj = `) = φj(ek)((pi)j) ∈ R, i = 1, . . . ,m, ` = 1, . . . , hj ,

for sampling points P = p1, . . . , pm ⊂ Ωα. Although the network has rank 1, these objectsare, in contrast to the situation for discrete tensor completion, not sparse.

Page 161: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 145

The same process can be applied to the bilinear thin-plate form B = B(α,α) in orderto obtain a discretized version

B(h) = B(h)(ζ,α(h)) := φα B α φ = B(φ(·), φ(·)) : Hα(h) × Hα(h) → R.

The decomposition of B = v∈V Bv yields an equivalent network B(h) = v∈V B(h)v , for

B(h)

v(outer)j

:= φj αj Bv(outer)j

αj φj = Bv(outer)j

(φj(·), φj(·)), j = 1, . . . , d,

and B(h)w = Bw for w /∈ V (outer). The discretized, univariate bilinear forms contained in each

B(h)

v(outer)j

, j = 1, . . . , d, can be represented by matrices D(0)hj

, Dhj and D(2)hj

, such that

Iαj (φj(x), φj(z)) = xTD(0)hjz,

dαj (φj(x), φj(z)) = xTDhjz,

d2αj (φj(x), φj(z)) = xTD

(2)hjz,

for all x, z ∈ Rhj . To each of these three types of matrices, we assign the mode labels

α(h)j , α

(h)j #. The matrix representation B(h) = B(h)(α(h),α(h)) ∈ Rd(α(h))×d(α(h)) of B(h)

again has the same decompositions as B and B(h) (cf. Section 6.1.2), but all bilinear formscontained in such need to be replaced by these discretized matrix versions. The ranks of allits matricizations hence remain 3 also in the discretized version.

The matrices D(0)hj

, Dhj and D(2)hj

, µ = 1, . . . , d, themselves can be computed fast. Ex-actly as the scalar products between the single functions fk, k ∈ N, and their derivates, theydepend on simple integrals of products of trigonometric functions. These in turn can beevaluated once symbolically. However, while B is bounded with respect to ‖ · ‖2,2, its traceis not finite, so the trace of B(h) is likewise not bounded for increasing h (for reasonable φ) .It can even grow rapidly, as d2

αµ(fk, fk) = ν4i for a linearly growing νi, i ∈ N (cf. Eq. (6.9)),

for every µ = 1, . . . , d. This leads to some complications for rank adaption as we will discussin Section 6.2.3.

For fixed rank, the discretized problem can be restated as

find argminN∈H

α(h)

‖L(h) N − y‖2 + λ2 ·N B(h) N,

subject to rankαJ (N) ≤ r(J), J ∈ K.

This objective function consists of two terms, for both of which we have discussed in Chap-ter 4 how to apply alternating least squares on tree tensor networks. The solution N+

v ofeach node-wise, linear problem, is then given by the normal equation

(N ′6=v L(h)′ L(h) N 6=v + N ′6=v B(h) N 6=v) N+v = N ′6=v L(h)′ y,

and defines the micro-step M(v), v ∈ V . Due to the low ranks of the networks L(h) andB(h), we can apply the procedures discussed in Sections 4.3.2 and 4.5.1, albeit not including

the simplifications for discrete tensor completion, as the nodes L(h)v , v ∈ V , are no longer

sparse. Note that if we replace L with any continuous, linear operator (with low tensorranks), the same algorithmic aspects still apply.

Page 162: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

146 6.2. Discretization and Alternating Least Squares

6.2.3 Rank Adaption and Practical Aspects

Rank adaption for the discretized problem setting introduced in Section 6.2.2 remainsan important aspect. There are however some complications to consider. Let WJ =α

(h)J

(UJ , σ(J), UJ), J ∈ K, be the same2 iteratively adapted weight matrices as defined

for reweighted least squares matrix recovery, Eq. (5.33), for some family K. The sameapproach as for the unscaled regularization as discussed in Section 5.5 then leads to theproblem setting

find argminN∈H

α(h)

‖L(h) N − y‖2 + λ2 ·N B(h) N + c · ω2∑

J∈K‖W 1−p/2

J N‖2F

subject to rankαJ (N) ≤ r(J), J ∈ K,

where we also introduced an additional, to be chosen scaling constant c > 0. In the microstepthat updates node v ∈ V , the family K is however restricted to those sets corresponding toneighboring singular values, as in Section 5.5.1, and we again denote this version of Salsaas Rwals (reweighted alternating least squares).

For the isometry φ = φ(α,α(h)) as in Section 6.2.2, we have that

‖W 1−p/2J N‖F = ‖W 1−p/2

J φ(N)‖L2 ,

for WJ := αJ (UJ , σ(J), UJ), where UJ = UJ(αJ , γ) corresponds to the left singular func-

tions of φ(N). In particular,

UJ(γ = i) = φJ UJ(γ = i), φJ = j∈J φj ,

while the singular values σ(J) remain identical. The scaling c is chosen as

c :=m

|Ωα|, |Ωα| =

Ωα

1. (6.12)

This constant adapts the two differently scaled terms ‖L(h)N‖ and ‖N‖F = ‖φ(N)‖L2to

each other. A more theoretical justification is given by the following lemma.

Lemma 6.5 (Uniform scaling constant). Let H(h)α be a finite-dimensional space isometric

to Hα(h) = Rh1×...×hd as in Section 6.2.2. Further, let L(h) : Rh1×...×hd → Rm be thediscretized sampling operator represented by the tensor node L(h). We have that

Ef∈H(h)

α : ‖f‖L2=1

( 1

m

m∑

i=1

f(pi)2)

=‖L(h)‖2Fm hD

−→m→∞

1

|Ωα|,

where hD = h1 . . . hd.

We recall that for L(h) = L(h)(ζ,α(h)), the terms

‖L(h)‖2F = L(h) α,ζ L(h) = traceα(L(h) ζ L(h)),

can be easily calculated. For operators L other than the sampling operator L(f)i := f(pi),i = 1, . . . ,m, it still holds true that

Ef∈H(h)

α : ‖f‖L2=1

(‖L(f)‖2L2

)=‖L(h)‖2FhD

,

so c :=‖L(h)‖2FhD

is a reasonable choice in such cases.

2Now written as tensor node.

Page 163: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 147

Proof. Interpreting f = f(α), the sum over sampling point values can be rewritten as

m∑

i=1

f(pi)2 = ‖L(h) φ−1(f)‖2F = φ−1(f) (L(h) ζ L(h)) φ−1(f),

for an L2-isometry φ : Hα(h) → H(h)α as in Section 6.2.2. Since further

f ∈ H(h)α : ‖f‖L2

= 1 = φ(N) | N ∈ Hα(h) : ‖N‖F = 1,

we have that

Ef∈H(h)

α : ‖f‖L2=1

( m∑

i=1

f(pi)2)

= EN∈Hα(h) : ‖N‖F=1

(N (L(h) ζ L(h))N

)

=1

hDtraceα(L(h) ζ L(h)) =

1

hD‖L(h)‖2F .

where we used Lemma 5.22 for ∅-orthogonal N = N(α(h)) (or in other words, for β = ∅,d(β) = 1) and the fact that here d(α(h)) = hD. As Monte Carlo integrals are unbiasedestimators,

1

m

( m∑

i=1

f(pi)2)→ 1

|Ωα|

Ωα

f2 =1

|Ωα|‖f‖2L2

holds true for every function f ∈ C0(Ωα). Thus, also

Ef∈H(h)

α : ‖f‖L2=1

( 1

m

m∑

i=1

f(pi)2)−→m→∞

1

|Ωα|.

This was to be shown.

A more elaborated scaling and scheme is provided by Section 5.5. However, we haveto exclude the thin-plate term in the derivation of the Salsa regularizer in Theorem 5.24,since for every µ = 1, . . . , d,

1

hµtrace(d2

αµ) :=1

hµ∑

k=1

d2αµ(fk, fk) −→

hµ→∞∞,

where fkk∈N may be any orthonormal basis of Hαµ . The sum even grows rapidly asdiscussed in Section 6.2.2. The origin of this problem is that we use the wrong norm forthis as well as the reweighted least squares approach discussed above. The approximationis thought of with respect to the norm ‖ · ‖B,λ,

‖ · ‖2B,λ := ‖ · ‖2L2+ λ2RTP(·)2,

and instead we seem to have no other choice than to use the weaker cross-norm ‖ · ‖L2 forany rank adaption related procedures. In principle, the motivation in Section 5.2.1 may beadapted, suggesting to average each micro-step with respect to

VA,ω = H ∈ Tr | ‖H −A‖B,λ ≤ ω,

but apart from that it is likewise to complicated to evaluate, it is not clear where this ap-proach leads to. An optimal solution for this dilemma so far remains hidden.

Page 164: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

148 6.3. Demonstration via Rolling Press Data

So instead, as mentioned above, we ignore the thin-plate regularization when it comes torank adaption. The update of a single node Nv, and thereby the micro-step M(v), is thengiven by (cf. Eqs. (4.26) and (5.43)),

Wσ6=v N+

v = Z 6=v ζ y, (6.13)

Z 6=v = Z 6=v(ζ,γ) := L(h) N 6=v = L(h)v ζh∈neighbor(v) B

(L(h)N)

v,h ,

for the adapted product representation

Wσ6=v′ = ζh∈neighbor(v)

(B(L(h)N)

v,h

′ ζ B(L(h)N)

v,h +ζw∈branchv(h)Tw ∅ω2

ch(Σp−2v,h)

′)

(L(h)v

′ ζ L(h)v ) + λ2 ζh∈neighbor(v)

(B(N′B(h)′N)

v,h

)B(h)

v

′,

and (as in Eq. (5.45))

Tv(outer)j

:= traceα

(h)j

(L(h)

v(outer)j

′ζ L(h)

v(outer)j

), j = 1, . . . , d,

Tw := L(h)w

′ ζ L(h)w , w /∈ V (outer).

This is a sum of two terms, for both of which we have discussed solution strategies in Sec-tion 5.6. The same procedures can without further issues be performed together. Notethat the scaling terms that appear in Eq. (6.13) are in close relationship to the situation inLemma 6.5, as they involve partial traces of LζL (cf. Section 5.5.2). In brief, the unscaled,reweighted least squares approach can also be derived as simplification of the scaled Salsaapproach, and will yield the same constant c > 0 as in Eq. (6.12).

The same semi-implicit rank adaption as in Section 5.6.2 can be applied, both for theunscaled, reweighted least squares, as well as the scaled Salsa approach. The least squaresproblems can analogously be solved through coarse CG as in Section 5.6.1, albeit not sub-ject to the simplifications discussed in Section 5.5.3 as the sampling operator does not yield

sparse nodes L(h)v . In practice, one only has to precompute the nodes L

(h)v and B

(h)v , v ∈ V ,

depending on the number hj of basis functions j = 1, . . . , d, which can be done with ne-glectable computational cost as discussed in Section 6.2.2. The smoothness parameter λcan further be adapted offline based for example on results on the validation set P2 (cf.Definition 5.18).

6.3 Demonstration via Rolling Press Data

In the following, we demonstrate the capabilities of the previously discussed algorithms ondata sets originating from pass schedules. These describe sets of parameters involved in therolling of metallic work pieces. Each constellation determines in which way the process isperformed. One is particularly interested to optimize these parameters in order to obtaindesired material properties, or to influence other relevant factors, which are usually denotedas quantities of interest (qoi). In industry, series of experiments have been performed, forwhich both inputs as well as outputs have been recorded.

It is of great interest to be able to numerically predict results other parameters constel-lations lead to, a task that would otherwise be time consuming and expensive. In otherwords, one wants to learn a multivariate function that approximates the underlying physicalmodel, and thereby interpolates known measurements. The already performed experimentsthen constitute the training, or sampling set. This task hence belongs to the classical frame-work of supervised learning.

Page 165: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 149

Naturally, the existing experimental data is of high financial value, such that they areunfortunately not available to us. The Institute of Metal Forming (IBF) at the universityRWTH Aachen however has an elaborate numerical model at their proposal (cf. [74]), whichhas proven itself to be reliable. They kindly provided data on which the experiments in thissection are based.

In practice, there are many secondary parameters which may only marginally contributeto the outcome, which are usually omitted from the model. We here focus on a limitedcollection of the most important ones. In each (numerical) experiment, a metal work pieceof fixed starting height h0 and length `0 is first heated up to a temperature T0. Afterwards,the actual rolling starts. As the width of the work piece (approximately) remains constantduring the process, and the volume is preserved, only the height remains as independentparameter.

The i-th so-called pass, i = 1, . . . , N , starts at time ti after the heating up phase. Then,with a speed of vi, the metallic work piece is rolled between two rotating presses in orderto reduce its height to hi. After each of these passes, the other end of the work piece thenfaces the presses.

After the N -th pass, at which the work piece has been rolled N times, the experimentends. Output quantities of interest include the required rolling force, the required energy,the rolling torque, the temperature of the work piece and most importantly, the grain sizeof the material during or after pass N . It strongly depends on each parameter how well itcan be interpolated. The physical model underlying the temperature development is rathersimple, whereas the grain size is the result of a complicated process that varies quickly ifsubjected to moderate changes of the input parameters.

In all numerical experiments, the only parameters that are varied are the starting temper-ature T0, and triplets of values (hi, ti, vi), i = 1, . . . , N . We treat the different quantities ofinterest separately, such that each one yields the task to recover a (d = 1+3N)-dimensional,expectedly smooth function. We further assume that the to be found mapping also exhibitsa low tensor rank. The global physical model is hence believed not to be subject to chaoticbehavior under change of one or a small group of parameters.

6.3.1 Implementational Details and Preprocessing

The provided data needs to be preprocessed in order to fit into the scheme of a multivariate,low-rank approximation. In particular, all constellations of parameters should be compatibleto each other. Instead of the height hi after pass i, we use the relative reduction of height

Qhi :=hihi−1

∈ [0, 1]

as input parameter. The start ti of pass i is replaced by

∆t1 := t1 ≥ 0,

∆ti := ti − ti−1 −∆tmin −`i−1

vi−1≥ 0, i ≥ 2.

The small value ∆tmin > 0 is a certain minimal time dilation necessary due to the machinedesign. Thereby, ∆ti = 0 corresponds to no additional time dilation between passes, andis hence always a physically interpretable parameter. The length of the metallic work piece

Page 166: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

150 6.3. Demonstration via Rolling Press Data

before pass i is simply given by `i−1 = h0·`0hi−1

, where h0 and `0 are common to all experi-

ments, hence throughout the provided data. The temperature T0 and the insertion speedsvi, i = 1, . . . , N do not require modification.

The choice of the family K, or in other words the graph G underlying the representationN are of great importance. Here, the knowledge about the physical interpretation of thedata can be used to design a proper tree. In particular, the parameter T0 and the triplets(Qhi,∆ti, vi), i = 1, . . . , N , are chronologically ordered. Our first choice is what could beregarded as hybrid of the tensor train and binary HT format as model, as it is depicted forN = 3 in Fig. 6.2.

We therefor treat the variables and their discretized versions (previously α and α(h)) asmode labels with ordering T0 < Qh1 < ∆t1 < v1 < Qh2 < . . . < vN , d = 1 + 3N . Thecorresponding family expressed in terms of these mode labels is

K1 = T0, Qh1, . . . , vN ∪ ∆ti, vii=1,...,N ∪ Qhi,∆ti, vii=1,...,N

∪ T0 ∪ki=1 Qhi,∆ti, vik=1,...,N−1,

such that |K1| = d+N +N +N − 1 = 6N . By assigning the numbers 1 for T0 up to d forvN , we were to obtain the same family, but as subset of 1, . . . , d as originally defined.

g1

T0

B1 B2

B3

g2

B4

g3 g4

Qh 1

∆t 1

v1

pass 1

B5

g5

B6

g6 g7

Qh 2

∆t 2

v2

pass 2

B7

g8

B8

g9 g10

Qh 3

∆t 3

v3

pass 3

Figure 6.2: The network Nvv∈V corresponding to the family K1, for Nv(outer)j

= gj , j = 1, . . . , d, and

Nww/∈V (outer) = Bw|V \V(outer)|

w=1 . Each triplet (Qhi,∆ti, vi) characterizes the pass i. Here, the caseN = 3, d = 10, is shown.

The tuning parameters appearing in Section 5.6.2 are chosen as fminor = 12 , kminor = 2,

rlim = 10, fσmin= 1

10 , mval = 15100m, (f

(min)ω , f

(max)ω ) = (1.001, 1.2), εprogress = 0.05. Other,

similar values yield similar qualites of approximation. We further use finite subspaces ofdimension hµ = 200, and construct φµ, µ = 1, . . . , d, as isometries with respect to the L2

norm, such that φµ(ek) = fk, k ∈ N. The smoothness parameter λ is optimized based onthe results for the validation set Pval, for each 5 trials with different values.

Alternative choices for the tree are the ordinary TT-format as well as a TT-like formatas depicted in Fig. 6.3, for the family

K2 = T0, Qh1, . . . , vN ∪ T0, Qh1, T0, Qh1,∆t1, . . . T0, . . . , vN−1, QhN,

Page 167: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 151

such that |K2| = d + d − 2 = 6N . The family as subset of 1, . . . , d is simply K2 =1, 2, . . . , d, 1, 2, 1, 2, 3, . . . , 1, . . . , d−2, so it differs by the additional sets J =µ, µ = 2, . . . , d, from the ordinary tensor train format.

g1 B1

g2

B2

g3

B3

g4

B4

g5

B5

g6

B6

g7

B7

g8

B8

g9

g10

T0

Qh1

Qh2

Qh3

∆t 1

∆t 2

∆t 3

v1

v2

v3

pass 1 pass 2 pass 3

Figure 6.3: The network Nvv∈V corresponding to the family K2, for Nv(outer)j

= gj , j = 1, . . . , d, and

Nww/∈V (outer) = Bw|V \V(outer)|

w=1 . Each triplet (Qhi,∆ti, vi) characterizes the pass i. Here, the caseN = 3, d = 10, is shown.

6.3.2 Numerical Results

For each qoi (quantity of interest), numbered with k ∈ N, the data set of IBF provides a

total of mmax = 50.000 sample points (xi, z(k)i )i=1,...,mmax ⊂ Ωα × R. For no noise and

exact data, we would expect z(k)i = u(k)(xi), where u(k) is the true physical model, that is,

the sought solution for the specific qoi. The measurements z(k)i are however only given up

to a certain accuracy, which naturally causes a limit of possible reconstruction accuracy (ascan be observed in the results below).

In each trial, we pick m = |P | points randomly from this collection and run Salsa, Al-gorithm 13, adapted to micro-steps given through Eq. (6.13). For fixed k, the vector of mea-

surements is given by y ∈ Rm, yi = z(k)j , for j s.t. pi = xj , i = 1, . . . ,m, P = p1, . . . , pm.

The choice of the tree network and tuning parameters are discussed in the previous Sec-tion 6.3.1.

For each of the five quantities of interest, for each m = 500, 1000, 2000, . . . , 16000 sam-pling points and N = 1, . . . , 6 (that is d = 1, 4, . . . , 19) we perform 5 trials3. After each run,we save the residual Rtest and normalized residual

Rtest,norm = Rtest/‖mean(ytest)− ytest‖2,

R2test := ‖Ltest(φ(N))− ytest‖22 =

m∑

i=1

(φ(N)(ci)− (ytest)i)2,

for a (randomly selected) disjoint test set c1, . . . , cm ⊂ ximmaxi=1 \P of same magnitude as

P . The normalized residuals of all 5 trials are then averaged with respect to the geometricmean.

As comparison, we repeat the previously carried out numerical experiment, but for theTT-like network corresponding to K2 (cf. Fig. 6.3) as well as the ordinary TT-format, in-stead of the (TT and) HT hybrid (cf. Fig. 6.2). The results for the TT-like and HT hybridare shown on the following pages in Figs. 6.4 to 6.8. Depending on the qoi, the data set itself

3This makes a total of 5 · 5 · 6 · 6 trials, times 5 if one counts the repeated runs in order to optimize λ.

Page 168: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

152 6.3. Demonstration via Rolling Press Data

is only given up to a certain precision which cannot be surpassed. The reconstructions of therequired energy and temperature come close to this optimal approximation quality, whilethe results for the rolling force and torque are reasonably accurate. The grain size poses aproblem in the sense that the data sets possibly only allow for an optimal approximationaccuracy of roughly 10−0.6. At least in this case, the applied algorithm yields only a roughinterpolation of the given data.The approximation with the TT-like network is worse, although not substantially, since theset K2 still captures the chronology of the physical background. The results for the ordinarytensor train format are not shown in the plots as they are consistently inferior (up to halfan order of magnitude) to those for the hybrid network. As that format provides a weakerregularization (that is |KTT| < |K1|), this suggests that the to be recovered physical model isnot only smooth up to a certain degree, but also subject to low-rank constraints and shouldbe approached accordingly.

Page 169: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 153

0.5k 1k 2k 4k 8k 16k

10−2.4

10−2.3

10−2.2

10−2.1

10−2

10−1.9

10−1.8

10−1.7

10−1.6

10−1.5

10−1.4

10−1.3

10−1.2

10−1.1

N = 1, d = 4

N = 2, d = 7

N = 3, d = 10

N = 4, d = 13

N = 5, d = 16

N = 6, d = 19

sampling size

norm

alizedresidual

Energy during N-th rolling press

HT deviation

TT: better

TT: worse

HT

Figure 6.4: The averaged (with respect to geometric mean), normalized residuals on test sets, for thereconstruction of the energy required for pass N , N = 1, . . . , 6 for the HT hybrid (K1, in blue) comparedto the TT-like format (K2, in bordeaux and turkis).) The accuracy of the data sets allows an optimalapproximation accuracy of roughly 10−2.4.

Page 170: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

154 6.3. Demonstration via Rolling Press Data

0.5k 1k 2k 4k 8k 16k

10−2

10−1

N = 1, d = 4

N = 2, d = 7

N = 3, d = 10

N = 4, d = 13

N = 5, d = 16

N = 6, d = 19

sampling size

norm

alizedresidual

Rolling force during N-th rolling press

HT deviation

TT: better

TT: worse

HT

Figure 6.5: The averaged (with respect to geometric mean), normalized residuals on test sets, for thereconstruction of the rolling force applied during pass N , N = 1, . . . , 6 for the HT hybrid (K1, in blue)compared to the TT-like format (K2, in bordeaux and turkis). The accuracy of the data sets allows anoptimal approximation accuracy of roughly 10−2.3.

Page 171: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 155

0.5k 1k 2k 4k 8k 16k

10−4

10−3

10−2

10−1

N = 1, d = 4

N = 2, d = 7

N = 3, d = 10

N = 4, d = 13

N = 5, d = 16

N = 6, d = 19

sampling size

norm

alizedresidual

Temperature after N-th rolling press

HT deviation

TT: better

TT: worse

HT

Figure 6.6: The averaged (with respect to geometric mean), normalized residuals on test sets, for thereconstruction of the temperature measured after pass N , N = 1, . . . , 6 for the HT hybrid (K1, in blue)compared to the TT-like format (K2, in bordeaux and turkis). The accuracy of the data sets allows anoptimal approximation accuracy of roughly 10−4.6.

Page 172: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

156 6.3. Demonstration via Rolling Press Data

0.5k 1k 2k 4k 8k 16k

10−3

10−2

N = 1, d = 4

N = 2, d = 7

N = 3, d = 10

N = 4, d = 13

N = 5, d = 16

N = 6, d = 19

sampling size

norm

alizedresidual

Rolling torque during N-th rolling press

HT deviation

TT: better

TT: worse

HT

Figure 6.7: The averaged (with respect to geometric mean), normalized residuals on test sets, for thereconstruction of the rolling torque measured during pass N , N = 1, . . . , 6 for the HT hybrid (K1, in blue)compared to the TT-like format (K2, in bordeaux and turkis). The accuracy of the data sets allows anoptimal approximation accuracy of roughly 10−3.7.

Page 173: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

6. Approximate Interpolation of High-Dimensional, Scattered Data 157

0.5k 1k 2k 4k 8k 16k

10−0.6

10−0.55

10−0.5

10−0.45

10−0.4

10−0.35

10−0.3

10−0.25

10−0.2

N = 1, d = 4

N = 2, d = 7

N = 3, d = 10

N = 4, d = 13

N = 5, d = 16

N = 6, d = 19

sampling size

norm

alizedresidual

Grain size after N-th rolling press

HT deviation

TT: better

TT: worse

HT

Figure 6.8: The averaged (with respect to geometric mean), normalized residuals on test sets, for thereconstruction of the grain size obtained after pass N , N = 1, . . . , 6. for the HT hybrid (K1, in blue)compared to the TT-like format (K2, in bordeaux and turkis).

Page 174: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

158 6.4. Comparison with Discrete Tensor Completion

6.4 Comparison with Discrete Tensor Completion

We repeat the numerical experiment approximation of three generic tensors with non uniformsets of singular values (see Section 5.7.2) in which the sampling is given on a discrete grid1, . . . , ndi , n = 8, d1 = 8, d2 = 7, d3 = 11. We do not change the sampling size m, but thetraining set P now consists of m uniformly random points in [1, n]di , i = 1, 2, 3 (similarlyso for the validation and test set). Further, we reduce the number of trials from 20 to each5, but search five times for an optimal parameter λ, as described in Section 6.3.1. Thedimension of the finite subspace is chosen as hµ = 100, µ = 1, . . . , di, and the maximal rankas rlim = 8, while all other tuning parameters remain as in Section 6.3.1. We consider thebinary hierarchical Tucker format with corresponding family (for d = 8)

KHT = 1, . . . , 8, 1, 2, 3, 4, . . . , 7, 8, 1, 2, 3, 4,

as well as the tensor train format with family KTT = 1, 1, 2, . . . , 1, . . . , d − 1 foreach Salsa and Rwals (cf. Section 5.7). The results are shown in Fig. 6.9.

0.4 0.7 1 1.3

·104

10−3

10−2

sampling size

relativeresidual

f(1)

0.4 0.6 0.8 1 1.2

·104

10−4

10−3

10−2

sampling size

f(2)

0.6 1 1.4 1.8

·104

10−5

10−4

sampling size

f(3)

binary tree

tensor train

Figure 6.9: Averaged relative test residuals and shadings proportional to the standard deviations as func-tions of the sampling size m = |P | as results of each 5 trials, for a binary HT (turkis) and tensor train(blue, filled symbols) format, for Salsa (round markers) and Rwals (crosses). The markers are exact, theintermediate lines are shape-preserving piecewise cubic Hermite interpolations of such.

For f (1) and f (2), the results for the TT-format are considerably worse that in the discretecase (cf. Fig. 5.9). For f (3), the difference is less notable. There may be several reasonsto this. While the continuity assumptions provide more regularity, the off-grid sampling isnot easy to interpret in regions where function values change rapidly, and the rank adaptionmay still be considerably improved. Further, it is not clear how well each function f (i) canbe approximated on the entire hypercube [1, n]di , i = 1, . . . , 3, within the discretized space

H(h)α (cf. Section 6.2.2). A polynomial basis might be more successful for such artificially

defined functions.

The binary HT format, unsurprisingly, outperforms the TT-format, as it contains moreedges and hence regularizes a larger variety of ranks. Even though it is not necessarily thecase, the functions here are chosen generically, such that one does not expect any partitionof modes to yield a significantly larger rank. Even then, this poses no algorithmic problemif one assumes the algorithm to be perfectly rank-adaptive.

The unscaled version Rwals performs slightly worse than Salsa, in particular for f (2).Interesting is that, as the benchmarks suggest, Rwals requires at least twice as many CGsteps in each micro-step, despite identical parameters and tolerances. This fact may berelatable to the so-called stable, internal tensor restricted isometry property (cf. [42]GrKr19),but requires further analysis.

Page 175: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Part III

Feasibility of Tensor SingularValues

Page 176: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 177: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 7

The Quantum Marginal andTensor Feasibility Problem

Chapters 4 to 6 have been subject largely to practical and algorithmic considerations. ThisChapter 7 and the subsequent Chapter 8 are of more theoretical nature and analyze thebehavior of singular values of tensors associated to families of matricizations, in particularhierarchical ones corresponding to tree tensor networks. We here also consider complex,Euclidean spaces Cm, m ∈ N , as discussed in Section 2.8.2.

7.1 The Tensor Feasibility Problem (TFP)

Feasibility carries a particular meaning for tree tensor networks and all specific formatscontained in that class as discussed below. Likewise, there are specific methods such as thetree SVD that can be applied to this setting. Nonetheless, feasibility is defined in a moregeneral context as we introduce in Section 7.1.2

7.1.1 Introduction regarding Tree Tensor Formats and QuantumPhysics

As high-dimensional generalization of the matrix SVD, tree tensor formats are associatedto tuples of singular values. These values yield distinguished levels of complexity, allow formore elaborate rank adaption methods and are a key tool to many other theoretical andpractical considerations. Yet since they depend on matricizations of the same objects, notfor every constellation a tensor can be found which realizes such given singular values.

For one thing, we are hence concerned to obtain a better intuition for the behavior ofsuch as well as to provide strict mathematical bounds. While it was for example not clearwhether singular values decline with similar exponential rates, it has since been shown thatthis is not necessarily the case. This setting has to be differentiated from estimates on sin-gular values based on knowledge about one specific tensor. We are here concerned with thegeneral possibility of constellations as it is intrinsic to the tensor spaces itself.

On the other hand, one is interested in fast methods to construct tensors with prescribedsingular values, for example in order to use such in numerical experiments or to demonstratetheir feasibility for further theoretical purposes. Last but not least, it turns out that theentire tensor feasibility problem is equivalent to a version of the quantum marginal problem,which plays a significant role in several fields within physics.

161

Page 178: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

162 7.1. The Tensor Feasibility Problem (TFP)

7.1.2 Formal Definition of the Tensor Feasibility Problem

Let K ∈ R,C and d ∈ N. For a (not necessarily hierarchical) family (cf. Definition 3.14)

K ⊂ J ⊂ I | J 6= ∅, I := 1, . . . , d− 1,each tensor A ∈ Kn1×...×nd yields a corresponding family of singular values σ(J)J∈K,

σ(J) := sv(A(J)), J ∈ K,where sv maps to the singular values of a matrix1. The reshapings

A(J) ∈ KnJ×nD\J , nJ :=∏

j∈Jnj , J ⊂ I ⊂ D := 1, . . . , d,

are introduced in Eq. (2.30) (for K = R), and we have A(J) = A(αJ ) once we treat A = A(α)as tensor node. We require these singular values σ(J) to be formally independent of the modesize n, so here each is considered a weakly decreasing, infinite sequence with nonnegativeand finitely many nonzero entries, hence element of the cone which we define as D∞≥0:

Definition 7.1 (Set of weakly decreasing tuples/sequences [68]Kr19). For n ∈ N, let Dn ⊂ Rnbe the cone of weakly decreasing n-tuples and let Dn≥0 := Dn ∩ Rn≥0 be its restriction to

nonnegative numbers. Further, let D∞≥0 ⊂ RN be the cone of weakly decreasing, nonnegative

sequences with finitely many nonzero entries. The positive part v+ ∈ Ddeg(v)>0 is defined as

the positive elements of v, where deg(v) := maxi:vi>0 i is its degree.

For example, for γ = (4, 2, 2, 0, 0, . . .) ∈ D∞≥0, we have deg(γ) = 3 and γ+ = (4, 2, 2) ∈D3>0. Similar to before, we denote

Γ := diag(γ+) =

4 0 00 2 00 0 2

.

In the initial part of this chapter, there is not yet reason to assign mode labels. For a fixedfamily K, the tensor feasibility problem now asks for the range of the map

svK : Kn1×...×nd → (D∞≥0)K,

svK(A) = sv(A(J))J∈K.In other words, a specific constellation of potential singular values is called feasible, if thereexists a tensor which realizes such:

Definition 7.2 (Tensor feasibility problem (TFP) [68]Kr19). For each J ∈ K, J ⊂ I, letσ(J) ∈ D∞≥0 (potential singular values). Then the collection σ(J)J∈K is called feasible for

n if there exists a tensor A ∈ Kn1×...×nd such that

sv(A(J)) = σ(J), A(J) ∈ KnJ×nD\J , (7.1)

for all J ∈ K.

Sets for which d ∈ J are not included in the definition since simply sv(A(J)) = sv(A(D\J))where then d /∈ D \J . Among all possible families K, there are many which yield equivalentproblems, considering for example permutations of I. The family of singular values can alsobe viewed as maps J 7→ σ(J), and we will use according indices to clarify. One condition forfeasibility, independent of the specific K, is the trace property. For any family to be feasible,we must have

‖σ(J)‖2 = ‖σ(J)‖2, ∀J, J ∈ K (7.2)

since simply ‖σ(J)‖2 = ‖A‖F if Eq. (7.1) holds true.

1σ(J) is not an unfolding in this case, but just an index.

Page 179: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 163

7.2 The Quantum Marginal Problem (QMP)

The quantum marginal problem (more specifically a certain version of it) has been knownin physics for multiple decades (cf. [18, 63, 89]). Its equivalence to the TFP however hasonly recently come to be known by the respective mathematical community as result of thereview process of [68]Kr19. The reason for this is not a complicated relation, but the differentways in which they are originally phrased2. For easier reference, we redefine the partial tracefor unlabeled tensors below as analogue to traceαI\J (cf. Remark 2.30).

Definition 7.3 (Partial trace). For J ⊂ I, we define the partial trace

traceI\J : KnI×nI → KnJ×nJ ,

as the linear map induced by

traceI\J(A1 ⊗ . . .⊗Ad−1) =∏

i/∈Jtrace(Ai) ·

i∈JAi ∈ KnJ×nJ .

for all elementary Kronecker product of matrices Ai ∈ Kni×ni .Instead of feasible singular values, the interest lies in so-called compatible eigenvalues.

Definition 7.4 (Quantum marginal problem (QMP) [68]Kr19). For each J ∈ K, J ⊂ I, letλ(J) ∈ D∞≥0 (potential eigenvalues). Then the collection λ(J)J∈K is called compatible for

(n1, . . . , nd−1) if there exists a Hermitian, positive semidefinite matrix ρI ∈ CnI×nI suchthat

ev(ρJ) = λ(J), ρJ = traceI\J(ρI) ∈ KnJ×nJ , (7.3)

for all J ∈ K, where ev maps to the eigenvalues (interpreted as infinite sequence) of a matrix.

Theorem 7.5 (Equivalence of TFP and QMP [68]Kr19). The feasibility of σ(J)J∈Kis equivalent to the compatibility of the entrywise squared values (σ(J))2J∈K througha matrix with rank(ρI) ≤ nd.

If conversely rank(ρI) is not specified, then nd may be chosen as large as necessary in theTFP in order to provide feasibility.

Proof. (cf. [68]Kr19) The equivalence of feasibility and compatibility is based on the corre-spondence

A(I)A(I)H = ρI , (7.4)

where ·H is the conjugate, or Hermitian, transpose. Given ρI , the tensor

A = A(α) ∈ Kn1×...×nd−1×rank(ρI),

here interpreted as tensor node, is uniquely defined via the Cholesky decomposition. Further,

traceI\J(A(I)A(I)H) = traceI\J((Aαd AH)(αI),(αI)) = (traceαI\J (Aαd AH))(αJ ),(αJ )

= (Aαd∪αI\J AH)(αJ ),(αJ ) = A(J)A(J)H

Thereby

λ(J) = eig(ρJ) = eig(traceI\J(A(I)A(I)H)) = sv(A(J))2 = (σ(J))2 (7.5)

for all J ⊂ D. It thus follows that Eq. (7.3) holds true for ρI given λ(J) = (σ(J))2, J ∈ K,if and only if Eq. (7.1) holds true for A provided nd ≥ rank(ρI).

2We here present the mathematical version.

Page 180: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

164 7.2. The Quantum Marginal Problem (QMP)

7.2.1 The Pure QMP

As we have seen, the value nd does not appear directly in the QMP. Through the relationEq. (7.4), it can be interpreted as the rank of ρI . Another version of the QMP is as follows:

Definition 7.6 (Pure quantum marginal problem). The pure quantum marginal problem isthe QMP as in Definition 7.4 with the additional constraint rank(ρI) = 1.

An equivalent way to define the pure QMP is to postulate I ∈ K as well as λ(I) =(trace(ρI), 0, 0, . . .). The TFP becomes equivalent with the additional condition nd = 1.

In the simplest version d − 1 = 2 of the pure QMP, two tuples of potential eigenvaluesλ(1), λ(2) ∈ D∞≥0 are given. It asks whether there is a Hermitian, positive definite, rank

one matrix ρI with ev(trace2(ρI)) = λ(1) and ev(trace1(ρI)) = λ(2). The equivalent

TFP, due to n3 = 1, asks if there is a matrix A ∈ Rn1×n2 with sv(A) = (λ(1))2 andsv(A(2)) = sv((A(1))T ) = (λ(2))2. Hence, λ(1) and λ(2) must be equal in order tobe compatible.

The pure QMP can always be related to an ordinary QMP, in the sense that compatibilityin first case can be decided through the second case:

Proposition 7.7 (Relation between pure and ordinary QMP). Let the families K ⊂ J ⊂I | J 6= ∅ and λ(J)J∈K be given. Further, let

f(J) :=

1, . . . , d− 1 \ J if d− 1 ∈ J,J if d− 1 /∈ J.

Then λ(J)J∈K is compatible for n = (n1, . . . , nd−1) ∈ Nd−1 in terms of the pure QMPand K if and only if λ(f(J)) = λ(J) for all J ∈ K, and λ(f(J))J∈K is compatible forn = (n1, . . . , nd−2) in terms of the (ordinary) QMP and the family

K := f(J) | J ∈ K ⊂ J | J ⊂ I, I := 1, . . . , d− 2,

through a matrix ρI ∈ KnI×nI with rank(ρI) ≤ nd−1.

Proof. The family λ(J)J∈K is compatible in case of the pure QMP if and only if σ(J)J∈K,(σ(J))2 = λ(J), J ∈ K, is feasible in terms of the TFP given nd = 1. Since Kn1×...×nd ∼=Kn1×...×nd−1 , the tensor dimension is effectively reduced. Further, because sv(A(J)) =sv(A(1,...,d−1\J)), we must also have λ(J) = λ(f(J)) for all J ∈ K. It follows that this

feasibility is equivalent to the one for d = d−1 and the above defined K. This in turn showsthe equivalence to the compatibility of λ(J)J∈K for rank(ρI) ≤ nd−1.

For example, for d − 1 = 3, the pure QMP for K = 1, 2, 3 is equivalent tothe (ordinary) QMP for K = 1, 2, 1, 2. In this case, the relation is well knownand for example mentioned in [63] (although usually the additional rank constraint is leftambiguous). Usually the partial traces in the setting of the pure QMP are denoted asρA, ρB , ρC and those for the ordinary QMP one as ρA, ρB , ρAB .

7.2.2 Results for the Quantum Marginal Problem

Usual tools for the QMP stem from algebraic geometry and all results so far suggest thatsets of compatible values form convex, closed, polyhedral cones. Among the most importantresults are those which are directly related to the Tucker (cf. Sections 2.5.2 and 7.4.3) andtensor train format (cf. Section 2.5.1 and Chapter 8):

Page 181: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 165

Pure QMP for K = 1, . . . , d− 1 (Tucker feasibility):For ni = 2, i = 1, . . . , d − 1, the physical interpretation of the pure QMP is related to anarray of qubits. As proven by [54], the family λ(J)J∈K is compatible if and only if

λ(i)2 ≤

j( 6=i)λ

(j)2 , i ∈ I = 1, . . . , d− 1. (7.6)

These inequalities are at the same time the H-description of the corresponding cone.

For d− 1 = 3 and n1 = n2 = n3, all constraints apart from the trace property are given by

λ(a)2 + λ

(a)3 ≤ λ(b)

2 + λ(b)3 + λ

(c)2 + λ

(c)3 , (7.7)

λ(a)1 + λ

(a)3 ≤ λ(b)

2 + λ(b)3 + λ

(c)1 + λ

(c)3 ,

λ(a)1 + λ

(a)2 ≤ λ(b)

2 + λ(b)3 + λ

(c)1 + λ

(c)2 ,

2λ(a)2 + λ

(a)3 ≤ 2λ

(b)2 + λ

(b)3 + 2λ

(c)2 + λ

(c)3 ,

2λ(a)3 + λ

(a)2 ≤ 2λ

(b)2 + λ

(b)3 + 2λ

(c)3 + λ

(c)2 ,

2λ(a)2 + λ

(a)1 ≤ 2λ

(b)2 + λ

(b)3 + 2λ

(c)2 + λ

(c)1 ,

2λ(a)2 + λ

(a)1 ≤ 2λ

(b)3 + λ

(b)2 + 2λ

(c)1 + λ

(c)2 ,

for each possible a, b, c = 1, 2, 3, as derived in [33,53]3. A general solution for d− 1 = 3but arbitrary n has been presented in [63, Theorem 3.4.1] and is largely based on algebraicgeometry. The author also states that a generalization to larger d is straightforward, but,due to the elaborate theoretical background, details remain unclear.

QMP for K = 1, 1, 2 (TT-feasibility for d− 1 = 2):Similarly, [18] provides a result by which the H-description of the corresponding cone canbe derived for each specific instance of d and n. For example, for n1 = 3 and n2 = 2 allinequalities apart from the trace property are given by

λ(1)1 ≤ λ(1,2)

1 + λ(1,2)2

λ(1)1 + λ

(1)2 ≤ λ(1,2)

1 + λ(1,2)2 + λ

(1,2)3 + λ

(1,2)4

λ(1)3 ≤ λ(1,2)

2 + λ(1,2)3

λ(1)2 + λ

(1)3 ≤ λ(1,2)

1 + λ(1,2)2 + λ

(1,2)3 + λ

(1,2)6 . (7.8)

Although the solution is in a certain sense complete, there have been and there are still openquestions. For example, [18] conjectured that in the special case n1 ≤ n2, compatibility of(λ1, λ1,2) is equivalent to just

k∑

i=1

λ(1)i ≤

n2k∑

i=1

λ(1,2)i , k = 1, . . . , n1, (7.9)

where equality must hold for k = n1 (which relates to the trace property for feasibility).This instance was later proven and thus confirmed by [72] (in again different notation).Furthermore, the results in [18] do not (seem to) yield any other universal inequalities, asfor example such as in Section 8.4.1.

3Note that here, λ(a)1 ≥ λ(a)

2 ≥ λ(a)3 .

Page 182: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

166 7.3. Independent Results for the Tensor Feasibility Problem

7.3 Independent Results for the Tensor Feasibility Prob-lem

The general equivalence between the tensor feasibility problem and the quantum marginalproblem has relatively late been established in [68]Kr19. There is a series of earlier, indepen-dent works and results for the Tucker (also known as HOSVD, [21,95]) feasibility problem,starting with [49], who first introduced the alternating projection method (cf. Section 7.4.2).Shortly afterwards, further steps have been taken in [47], in which a Newton method for theassignment of prescribed singular values, including a convergence analysis, is presented.

Through matrix analysis and eigenvalue relations, [23] later introduced necessary and suffi-cient linear inequalities regarding feasibility mostly restricted to the largest Tucker singularvalues of tensors with one common mode size (cf. Section 7.4.3). For d = 3, the necessaryand sufficient conditions for the existence of a tensor with largest Tucker singular values

σ(a)1 , a ∈ 1, 2, 3, are

(σ(1)1 )2 + (σ

(2)1 )2 ≤ ‖T‖2 + (σ

(3)1 )2,

(σ(1)1 )2 + (σ

(3)1 )2 ≤ ‖T‖2 + (σ

(2)1 )2,

(σ(2)1 )2 + (σ

(3)1 )2 ≤ ‖T‖2 + (σ

(1)1 )2.

The necessity of these inequalities can, given the equivalence of the TFP and QMP (The-orem 7.5), directly be derived from the QMP result Eq. (7.7) as well as the fact that

‖T‖2 = (σ(a)1 )2 + (σ

(a)2 )2 + (σ

(a)3 )2 for all a ∈ 1, 2, 3.

Independently, [90] proved the same result (cf. Eq. (7.6)) for the Tucker format providedn1 = . . . = nd = 2 using yet other approaches within algebraic geometry. The possibilityof decoupling, which we will consider in Section 7.4.1, and then pursue in Sections 7.4.3and 8.1 for the Tucker and tensor train format, has first been mentioned in [68]Kr19 and wewill here present its generalization in detail.

7.4 Feasibility in Tree Tensor Networks

As we have introduced in Section 3.3, hierarchical families K are related to tensor networkdecompositions. Further, Section 2.8.2 explains how these results are easily generalized toK = C. We can thereby decouple certain feasibility problems into much easier, smallerpieces, as discussed in the following.

7.4.1 Decoupling through the Tree SVD

Throughout this section, we assume that the family K fulfills the hierarchy condition (asintroduced in Definition 3.14)

J ⊂ S ∨ S ⊂ J ∨ J ∩ S = ∅, ∀J, S ∈ K. (7.10)

As provided by Theorem 3.16 and Section 3.3.1, there is a corresponding graph G =(V,E, L), such that for each J ∈ K, there is an edge e = eJ = v, w ∈ E, J = Je,which yields the partition J,D \ J of modes, i.e.

αJ := αjj∈J = bv(w) := m(h) | h ∈ branchv(w).For a tensor tree network N = Nvv∈V corresponding to this graph, given

A = A(α1, . . . , αd) = v∈VNv ∈ Kn1×...×nd ,

Page 183: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 167

these edges are also in direct relation to the singular values σ(J) = sv(A(J)) = sv(A(αJ )). Wehence may also reference the singular values indirectly via edges e = eJ instead of J = Je,as in Theorem 3.16,

σe := σ(Je), e ∈ E.

In particular, given the graph G, the set K does not necessarily need to be specified, as inthe following theorem. It is a generalization of the result for the tensor train format (cf.[68]Kr19).

Theorem 7.8 (Decoupling). Let σ(J)J∈K ∈ (D∞≥0)K be a family of singular val-ues, for a hierarchical family K corresponding to a tree G = (V,E). Further, letΣe = diag((σe)+), e ∈ E. Then the family is feasible for n ∈ N if and only if foreach vertex v ∈ V , there exists a node Nv = Nv(m(v)) with property ii) of the treeSVD:

For each w ∈ neighbor(v) ⊂ V (hence v, w ∈ E), the node

Nw h∈(neighbor(w)∩V )\v Σw,h (7.11)

is orthogonal with respect to m(v, w) = m(σw,h).

Proof. By definition, the family is feasible iff there is a tensor A with according singularvalues. The assertion then follows thereby directly through property ii) of the tree SVD,Theorem 3.16.

N3

σ1,3+

σ2,3+

σ3,4+

N4

σ3,4+ σ4,5+

Figure 7.1: Example for the decoupling Theorem 7.8 for the network as in Fig. 3.5. The nodes v ∈ 1, 2, 5introduce only minor conditions (cf. Remark 7.12). Note that in this chapter, singular values are consideredinfinite sequences (therefore the +).

For Sw := Nw h∈neighbor(w)∩V Σw,h, an alternative characterization of the orthogonalityconstraint Eq. (7.11) is

SHw \m(v,w) Sw = Σ2w,v, (7.12)

for all v ∈ neighbor(w) ∩ V . This provides the following corollary, which, for the Tuckerformat, has been introduced in [10].

Corollary 7.9. For a hierarchical family K, feasibility of singular values depends on thesolvability of a system of sparse, second order polynomials.

Constructive algorithms can build on this assertion as indicated in Section 7.4.2. Notehowever that through the QMP results, we already know that feasibility in such a case isdetermined through linear inequalities in the squared singular values. The theorem takessimpler forms for specific formats, such as the Tucker (cf. Theorem 7.18 and Corollary 7.15)and tensor train-format (cf. Corollaries 8.3 and 8.4). Each single node Nv for fixed v ∈ V

Page 184: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

168 7.4. Feasibility in Tree Tensor Networks

may itself be embedded into a small network if we attach an artificial, orthogonal node toeach leg of Nv which is not a leg in G:

Lemma 7.10 (Supplementation to tree SVD). Let v ∈ V be fixed. For each vertex w ∈neighbor(v) ⊂ V , let further βiw = m(v, w) (for some iw ∈ N) and let Qw = Qw(β′iw , βiw) ∈Kd(βiw )×d(βiw ) be an βiw -orthogonal node. Additionally, let Nv fulfill the above property ii),Eq. (7.11). Then the network

W := Nv ∪ Qww∈neighbor(v) ∪ σv,w+w∈neighbor(v)

is the tree SVD of the tensor Tv := H∈WH, which thereby has singular values

σv,w = sv(T(β′iw )v )

for w ∈ neighbor(v).

Proof. The tree SVD of Tv directly yields the orthogonality conditions for Qw as well asEq. (7.11) for Nv. Hence, the singular values of Tv are as stated.

N3

Q1

Q2

Q4

σ1,3+

σ2,3+

σ3,4+

Q3 N4 Q5

σ3,4+ σ4,5+

Figure 7.2: Example for the attachment of nodes as in Lemma 7.10 for the decoupled network as in Fig. 7.1.The left tensor corresponds to Tucker feasibility and the right one to TT-feasibility in d = 3. Note that inthis chapter, singular values are considered infinite sequences (therefore the +).

Combining the previous results, we obtain the following corollary.

Corollary 7.11 (Formulation as feasibility problem). Let σ(J)J∈K ∈ (D∞≥0)K be a familyof singular values, for a hierarchical family K corresponding to the tree graph G = (V,E).Then this family is feasible for n ∈ Nd if and only if for each single vertex v ∈ V , thefollowing holds true:

For w1, . . . , wk := neighbor(v), k ∈ N, and j1, . . . , j` = i ∈ 1, . . . , d | αi ∈ m(v),` ∈ N0, the subfamily σ(i)ki=1 ∈ (D∞≥0)K defined by

σ(i) := σv,wi, i = 1, . . . , k,

is feasible with respect to the family K := i | i = 1, . . . , k and the mode sizes

n = (rv,w1, . . . , rv,wk, nj1 , . . . , nj`) ∈ Nk+`

where rv,wi = deg(σv,wi) (cf. Eq. (3.8)).

For each v ∈ V , the mode sizes nj1 , . . . , nj` may also be folded into a single mode sizem = nj1 · . . . · nj` = d(m(v)).

Proof. For each v ∈ V , the mode sizes contained in n are exactly those of Tv as inLemma 7.10. The assertion hence follows directly from Theorem 7.8 and Lemma 7.10.

Page 185: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 169

Remark 7.12 (Conditions towards leaves). The conditions towards nodes Nv = Nv(αi, βj)(for some i, j), i.e. those which have exactly one leg and one edge in G, are simple, namelythat Nv must be βj-orthogonal. Such a node exists if and only if ni = d(αi) ≥ d(βj). Hence,the only condition imposed through these nodes is that the number of nonzero singularvalues in σeJ = σ(J), J = i, must be lower or equal than the mode size ni = d(αi), i.e.

deg(σ(i)) ≤ ni.

For the simple matrix case A = UΣV H ∈ Kn1×n2 , this, as is known, constitutes the onlyrestriction to feasibility, i.e. rank(A) ≤ n1, n2.

Considering the (pure) QMP results for the thee-dimensional Tucker format, the aboveassertions reveal that squared feasible singular values associated to hierarchical families formclosed, convex, polyhedral cones, as formalized in the following theorem. Its proof follows asimilar approach as the one of Theorem 7.18.

Theorem 7.13. Let K be a hierarchical family, n ∈ Nd and r ∈ NK. Further, let

FK,r,n be the set of families σ(J)J∈K ∈×J∈KDr(J)

≥0 for which σ(J)J∈K, σ(J) :=

(σ(J), 0, . . .) ∈ D∞≥0, J ∈ K, is feasible for K and the mode sizes n. Then

F2K,r,n :=

(J1)1 , . . . , σ

(J1)

r(J1) , σ(J2)1 , . . . , σ

(Jk)

r(Jk)

)| σ(J)J∈K ∈ FK,r,n

∈ D

∑ki=1 r

(Ji)

≥0 ,

for K =: J1, . . . , Jk, is a closed, convex, polyhedral cone.

Proof. We extend the set K to a hierarchical family K ⊂ K ⊂ J ⊂ I | J 6= ∅ such that each

vertex in the graph G = (V , E, L) corresponding to K has either three neighbors and no leg(as in the left diagram in Fig. 7.1), or is a leaf as in Remark 7.12. The ranks r ∈ NK are

further supplemented to r ∈ NK using largest possible values with respect to the given modesizes n (cf. Remark 3.17). Thereby, in the situation of Corollary 7.11, for each v ∈ V that

is not a leaf, we have n ∈ N3 and the subfamily K = 1, 2, 3. This case correspondsto the three-dimensional (i.e. d − 1 = 3) Tucker format. Thus, by the (pure) QMP resultsin [63] (cf. Section 7.2.2), as well as Corollary 7.11 and Remark 7.12, we know that F2

K,r,nis a closed, convex, polyhedral cone. Through elimination of all entries corresponding to theadditional sets J ∈ K \ K by according projections of F2

K,r,n, we obtain F2K,r,n. This set is

hence again such a cone.

7.4.2 Iterative Algorithms to Construct Tensors with PrescribedSingular Values

Even without distinct knowledge about feasibility constraints, one can (attempt to) con-struct tensors with prescribed singular values. The alternating projection method for theTucker format is presented in [49]. The modifications in order to meet the requirementin Corollary 7.11 is straightforward and shown in Algorithm 14. The fixed points of thisiteration are tensors T ∈ Rr1×...×rk×m for which

sv(T (j)) = σ(j).

If the algorithm converges, the tensor T proves the feasibility of σ(j)kj=1, for the mode

sizes n = (r1, . . . , rk,m) and the family K := j | j = 1, . . . , k. This is the tensor re-quired in Corollary 7.11 (if one assigns according mode names and sets m = nj1 · . . . · nj`).

Page 186: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

170 7.4. Feasibility in Tree Tensor Networks

Algorithm 14 Alternating projection method

Input: potential singular values σ(j) ∈ Drj>0 for rj ∈ N, j = 1, . . . , k, and the mode sizem ∈ N (as well as tol > 0, itermax > 0)

Output: if successful, tensor T with approximate singular values svj(T ) = σ(j), j =1, . . . , k

1: procedure altproj(σ,m)

2: initialize N(1)0 ∈ Kr1×...×rk×m randomly . and assign labels β1, . . . , βk, α to N

(·)·

3: abbreviate σj := σ(j), j = 1, . . . , k . and assign βj to σj

4: set σ(0)j ≡ 0, relres = 1 and i = 0

5: while relres > tol and i ≤ itermax do6: i = i + 1

7: for j = 1, . . . , k do

8: calculate the SVD and set . then (N(i)j )(j) = (N

(i)j )(βj)

U(i)j Σ

(i)j V

(i)j = (N

(i)j−1)(j), for Σ

(i)j = diag((σ

(i)j )+).

9: set N(i)j via (N

(i)j )(j) = U

(i)j Σj V

(i)j . or N

(i)j = (U

(i)j , σ

(i)j , V

(i)j )

10: end for11: set N

(i+1)0 = N

(i)k

12: relres = maxj=1,...,k

(max`=1,...,rj

∣∣ (σ(i)j )`

(σj)`− 1∣∣)

. entrywise relative residual

13: end while

14: if relres ≤ tol then15: return T = N

(i+1)0

16: else17: return that procedure failed, σ might not be feasible18: end if19: end procedure

The article [49] shows that each replacement of singular values is equivalent to a projec-tion in the Frobenius norm, i.e.

N(i)j = argmin

W∈M(j)

σ(j)

‖W −N (i)j−1‖F , M(j)

σ(j) := W | sv(W (j)) = σ(j).

Unfortunately, there are no convergence proofs for this method. Only for k = 2, one canprove that it at least cannot diverge [68]Kr19. A further numerical approach is for examplea suitable Newton method [47]. On the other hand, a more algebraically centered attemptis the transformation of the second order polynomial system Eq. (7.12) into a so-called LS-CPD [10], as it is therein done for the Tucker format (cf. Eq. (3.26)). This approach iseasily generalized, for example to the tensor train-format (cf. Eq. (8.5)).

7.4.3 Feasibility of Largest Tucker Singular Values

We consider the feasibility problem (Definition 7.2) for the family

KTucker = 1, 2, . . . , d− 1, 1, . . . , d− 1,

Page 187: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 171

which yields the matricizations for the Tucker format (or the HOSVD). Instead of the lastset 1, . . . , d− 1, we use the equivalent set d as in Eq. (3.25). The corresponding family

of singular values σ(J)J∈K is abbreviated σ(µ)Tucker := σ(µ), as in Section 3.4.2, and we

recall

σTucker := (σ(1)Tucker, . . . , σ

(d)Tucker) := svTucker(A),

σ(µ)Tucker := sv(A(µ)),

while we omit the index Tucker in this subsection. Feasibility for the Tucker format thenstates as follows.

Definition 7.14 (Tucker feasibility). Let σ = (σ(1), . . . , σ(d)) ∈ (D∞≥0)d (potential singular

values). Then σ is called Tucker feasible for n = (n1, . . . , nd) ∈ Nd if there exists a tensorA ∈ Kn1×...×nd such that σ = svTucker(A).

The conditions for feasibility can be separated into constraints towards the size n and asingle node.

Corollary 7.15 (Reduction to core C). The feasibility of σ is equivalent to the existenceof an (all-orthogonal) core S ∈ Kr1×...×rd for which Eq. (3.26) holds true, for rµ ≤ nµ,µ = 1, . . . , d.

Proof. This is Theorem 7.8 for the Tucker format.

The article [23] establishes that for any tensorA ∈ Cn1×...×nd with largest Tucker singular

values (σ(1)1 , . . . , σ

(d)1 ), it holds that

s∈D\µ(σ

(s)1 )2 ≤ (d− 2)‖A‖2 + (σ

(µ)1 )2, (7.13)

‖A‖ ≥ σ(µ)1 ≥ ‖A‖√

nµ, µ = 1, . . . , d. (7.14)

The two latter inequalities just state that each σ(µ)1 can indeed be the largest singular values.

So for a tensor of dimension d = 3, the non trivial parts inequalities are

(σ(1)1 )2 + (σ

(2)1 )2 ≤ ‖A‖2 + (σ

(3)1 )2, (7.15)

(σ(1)1 )2 + (σ

(3)1 )2 ≤ ‖A‖2 + (σ

(2)1 )2,

(σ(2)1 )2 + (σ

(3)1 )2 ≤ ‖A‖2 + (σ

(1)1 )2.

For the case nµ = n, µ = 1, . . . , d, [23] also shows that these inequalities are sufficient for

the existence of a tensor A ∈ Cn×...×n with largest singular values (σ(1)1 , . . . , σ

(d)1 ). This is

first shown for the case d = 3, and then generalized to an arbitrary dimension.

It is tempting to transfer these conditions to tensors with non uniform mode sizes. Thearticle [23] however already provides examples for nonfeasible constellations in that case,which yet fulfill the above inequalities. We supplement their results with a necessary in-equality based on feasibility for TT-formats in Chapter 8. For example, in three dimensions,a Tucker tree SVD (C,U1, . . . ,U3,Σ

(1), . . . ,Σ(3)) of a tensor A = A(α) (cf. Section 3.4.2)can be partially contracted to a TT-tree SVD (U1,G2,U3,Σ

(1),Σ(3)), such that

A = U1 Σ(1) G2 Σ(3) U3, G2 = C Σ(2) U2.

By the later (independent) Corollary 8.32 (therein for r = 1 and m = n2), it follows that

(σ(1)1 )2 ≤ n2(σ

(3)1 )2 ⇔ σ

(3)1 ≥ 1√

n2σ

(1)1 . (7.16)

Page 188: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

172 7.4. Feasibility in Tree Tensor Networks

For σ(µ)1 := 1√

nµ, ‖A‖ = 1, the three inequalities in Eq. (7.15) are fulfilled if only n1, n2, n3 ≥

2. However, Eq. (7.16) implies that in order for these values to be candidates for feasible,largest Tucker singular values, it must hold that

1√n3≥ 1√

n2

1√n1

⇔ n3 ≤ n1n2.

Due to symmetry, this also holds for all permutations of modes. We summarize the abovediscussion in the following remark (cf. [23]).

Remark 7.16. The conditions σ(µ)1 ≥ 1√

n, µ = 1, . . . , d, cannot simply be relaxed to σ

(µ)1 ≥

1√nµ

, µ = 1, . . . , d, regarding a tensor A ∈ Cn1×...×nd , without losing sufficiency.

Although already shown in the final publication of [23], we will provide a different kindof as well as constructive proof of the following theorem Theorem 7.18. We therefor use thedecoupling Theorem 7.8 (as also used for the tensor train case) as well as the results forthree-dimensional tensors by [23]. The technical part is provided by the following lemma,for which we define

~1n := (1, . . . , 1)T ∈ Rn, En := ~1n~1Tn − 2In ∈ Rn×n,

where In is the identity matrix. Further, the index 6=k denotes the restriction of all entries,or columns, or rows, to all but the k-th one.

Lemma 7.17. Let 2 ≤ k, ` ∈ N as well as m = k + ` − 2. Then, for w > v > 0, thepolyhedron

P := x ∈ Rm+1 | (Ek, 0)x ≤ (k − 2)w ·~1k, (0, E`)x ≤ (`− 2)w ·~1`,x ≤ w ·~1m+1, −x ≤ −v ·~1m+1

projected onto H = x ∈ Rm+1 | xk = 0 is given by

PrH(P ) = (y1, . . . , yk−1, 0, yk, . . . , ym) | Emy ≤ w(m− 2) ·~1m,y ≤ w ·~1m, −y ≤ −v ·~1m.

For example, for k = 3 and ` = 4, we have

P =x ∈ R6 |

(−1 1 11 −1 11 1 −1

)·(x1x2x3

)≤(www

),

(−1 1 1 11 −1 1 11 1 −1 11 1 1 −1

)·(x3x4x5x6

)≤(

2w2w2w

),

v ≤ xi ≤ w, i = 1, . . . 7

and

PrH(P ) =

(y1, y2, 0, y3, y4, y5) |(−1 1 1 1 1

1 −1 1 1 11 1 −1 1 11 1 1 −1 11 1 1 1 −1

)·( y1y2y3y4y5

)≤(

3w3w3w3w3w

),

v ≤ yi ≤ w, i = 1, . . . 5.

The result PrH(P ) (ignoring the zero component) for

y = ((σ(1)1 )2, . . . , (σ

(d)1 )2), w = ‖A‖2, v =

1√n,

is exactly the polyhedron defined by the inequalities Eq. (7.13).

Page 189: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 173

Proof. The above polyhedron is given by P = x ∈ Rm+1 | B · (x,−1) ≤ 0 for

B =

(Ek, 0) (k − 2)w ·~1k(0, E`) (`− 2)w ·~1`Im+1 w ·~1m+1

−Im+1 −v ·~1m+1

.

Let J+/0/− = i | Bi,k >/=/< 0, respectively. Fourier-Motzkin-Elimination states that

PrH(P ) = (y1, . . . , yk−1, 0, yk, . . . , ym) | BJ0,6=k · (y;−1) ≤ 0, C · (y;−1) ≤ 0,

where

C = BJ+,k ⊗BJ−,6=k −BJ+,6=k ⊗BJ−,k.

First of all,

BJ0,6=k =

(Im w ·~1m−Im −v ·~1m

)

as χk = 0 by assumption. Furthermore, since BJ+,k ≡ 1 and BJ−,k ≡ −1, the rows of Cconsist of all possible sums of one row of BJ−,6=k and one of BJ+,6=k each. These two matricesare given by

BJ−, 6=k =

~1Tk−1 0 (k − 2)w

0 ~1T`−1 (`− 2)w0 0 −v

, BJ+, 6=k =

Ek−1 0 (k − 2)w ·~1k−1

0 E`−1 (`− 2)w ·~1`−1

0 0 w

.

Each row of C hence equals a row in either

C1 :=

Ek−1 + 1 0 2(k − 2)w ·~1k−1

0 E`−1 + 1 2(`− 2)w ·~1`−1

0 0 w − v

, C2 :=

(Em (m− 2)w ·~1m

),

C3 =

(~1Tk−1 0 (k − 1)w

0 ~1T`−1 (`− 1)w

), C4 :=

(Ek−1 0 ((k − 2)w − v) ·~1k−1

0 E`−1 ((`− 2)w − v) ·~1`−1

).

Now, the third row in C1 is redundant since 0 < v < w. Each other row in C1, C3 as wellas C4 is redundant to the rows of BJ0,6=k (with respect to the represented polyhedron). Theremaining, nonredundant rows are those defining PrH(P ), which finishes the proof.

The similarity of the polyhedron P to its projection PrH(P ) makes it possible to in-ductively proceed projections of such, leading to the alternative, but constructive (cf. Sec-tion 7.4.4) proof of the following theorem.

Theorem 7.18 (Sufficiency [23]). Equations (7.13) and (7.14) as well as deg(σ(µ)) ≤ n,µ = 1, . . . , d are sufficient for the existence of a tensor A ∈ Cn×...×n of dimension d with

largest singular values σ = σTucker = (σ(1)1 , . . . , σ

(d)1 ).

Proof. Let A = A(α1, . . . , αd) and σ(µ) = σ(µ)(βµ). The existence of a tensor A with singularvalues σ is equivalent to the existence of a network

N1 = N1(α1, β1), Nµ = Nµ(αµ, βµ), µ = 2, . . . , d− 1, Nd = Nd(αd, βd),M1 =M1(β1, β2, γ1), Mµ =Mµ(γµ−1, βµ+1, γµ), µ = 2, . . . , d− 3,

Md−2 =Md−2(γd−3, βd−1, βd), θ(µ) = θ(µ)(γµ), µ = 1, . . . , d− 2,

Page 190: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

174 7.4. Feasibility in Tree Tensor Networks

such that A = (N1, . . . ,Nd,M1, . . . ,Md−2, σ(1), . . . , σ(d), θ(1), . . . , θ(d−2)) and

Rσ∪θ := (N ∪M, σ ∪ θ) (7.17)

is the tree SVD of A (depicted in Fig. 7.3 for d = 6).

M2M1 M3 M4N1

N2 N3 N4 N5

N6

σ(1)

σ(2) σ(3) σ(4) σ(5)

σ(6)θ(1) θ(2) θ(3)

C

Figure 7.3: The tree SVD Rσ∪θ of the tensor A. The additional nodes M and θ form the core C in thecorresponding Tucker SVD.

The network Rσ∪θ is a tree SVD iff the corresponding orthogonality constraints Eq. (7.11)towards the inner nodes M1, . . . ,Md−2 and the leaves N1, . . . ,Nd are fulfilled. Such leavescan be constructed since deg(σ(µ)) ≤ n, µ = 1, . . . , d (cf. Remark 7.12) by assumption. Theexistence of the inner nodes on the other hand is, due to Lemma 7.10, equivalent to theexistence of three-dimensional tensors S1, . . . , Sd−2 with Tucker singular values

svTucker(S1) = (σ(1), σ(2), θ(1)), svTucker(S2) = (θ(1), σ(3), θ(2)), . . .

svTucker(Sd−2) = (θ(d−3), σ(d−1), σ(d)),

where each triplet of singular value nodes is the one neighboring to Mµ, µ = 1, . . . , d − 2

within the network Rσ∪θ (cf. Fig. 7.3). In order to prove the existence of a tensor A asabove, it hence remains to show that there are mode sizes d(γµ) ∈ N and singular valuesθ(µ), µ = 1, . . . , d− 2, such that S1, . . . , Sd−2 (of respective size) can be constructed.As it turns out, we can choose d(γµ) = n allowing us to apply earlier results. By theTucker feasibility result for d = 3 [23], the existence of a tensor Sµ ∈ Cn×n×n, for each

single µ = 1, . . . , d−2, with largest singular values (σ(1)1 , σ

(2)1 , θ

(1)1 ), . . . , (θ

(d−3)1 , σ

(d−1)1 , σ

(d)1 ),

respectively, depends on the inequalities Eqs. (7.13) and (7.14) (for d = 3). Due to thesymmetry of the explicit construction provided in [23], all singular values but the largest

one can be assumed to equal each other in that case. In particular, we can choose (θ(µ)2 )2 =

. . . = (θ(µ)n )2 = ‖A‖2 − (θ

(µ)1 )2, for each µ = 1, . . . , d − 2 (if the corresponding triplet is

feasible), such that each neighboring pair Sµ, Sµ+1 indeed shares the same singular valuesθ(µ).Let P := P1 ∩ . . . ∩ Pd−2 be the intersection of the Tucker feasibility cones defined throughshifted versions of Eqs. (7.13) and (7.14) (for d = 3), where each Pµ corresponds to theinequalities regarding Sµ, µ = 1, . . . , d − 2. The tensor A with prescribed largest singular

values σ(µ)1 , µ = 1, . . . , d, hence exists if there are singular values θ

(µ)1 , µ = 1, . . . , d− 2, such

that

(y, z) = ((σ(1)1 )2, . . . , (σ

(d)1 )2, (θ

(1)1 )2, . . . , (θ

(d−2)1 )2) ∈ P.

Differently phrased, this is equivalent to

y ∈ P∗ = y ∈ Rd | ∃z ∈ Rd−2 : (y, z) ∈ P.

Page 191: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

7. The Quantum Marginal and Tensor Feasibility Problem 175

Now, each intersection of two such Tucker feasibility cones such as Pi resembles the cone Pas in Lemma 7.17. The projection, which eliminates the coordinate xk associated to some

θ(µ)1 , further yields PrH(P ). As each PrH(P ) is again a Tucker feasibility cone such as Pi, we

can inductively apply Lemma 7.17 until only coordinates y = ((σ(1)1 )2, . . . , (σ

(d)1 )2) remain.

As indicated earlier, the result P∗ is the cone given by Eqs. (7.13) and (7.14) for dimensiond. The resulting inequalities defining P∗ are thereby sufficient for the existence of A asabove with given, prescribed largest Tucker singular values. This was to be shown.

7.4.4 Direct Construction of Tensors Realizing Prescribed, LargestTucker Singular Values

The article [23] provides a direct procedure to construct three-dimensional tensors withprescribed, largest Tucker singular values, provided such are feasible. For arbitrary dimen-sions however, the same method involves the 2d − d vertices of the polyhedron defined byEq. (7.13). The decomposition Rσ∪θ as shown in Fig. 7.3 on the other hand allows for a fastconstruction based on only its (at most) three-dimensional components. The feasibility ofeach additional singular values θ(µ), µ = 1, . . . , d − 2, is determined by simple constraints.We therefore denote

(λ(1), . . . , λ(2d−3)) := (σ(1), σ(2), θ(1), σ(3), θ(2), σ(4), . . . , θ(d−2), σ(d−1), σ(d)), (7.18)

λ(i)2 = . . . = λ

(i)n := (‖A‖2 − (λ

(i)1 )2)

12 , i = 1, . . . , 2d− 3.

Then for each µ = 1, . . . , d− 2, the triplet of largest singular values (λ(2µ−1)1 , λ

(2µ)1 , λ

(2µ+1)1 )

must fulfill the three inequalities Eq. (7.15) as well as the trivial bounds Eq. (7.14). Valid

choices for θ(1)1 , . . . , θ

(d−2)1 can hence be quickly be determined via a linear programming

algorithm, as the number of these sparse inequalities grows only linearly in d. Then, basedon [23], the tensor Sµ can be constructed, from which we obtain Mµ. The result is an HT

tree SVD of a tensor A with largest Tucker singular values (σ(1)1 , . . . , σ

(d)1 ). The procedure

is summarized in Algorithm 15. With the choice of N as in the algorithm, the representedtensor A is even all-orthogonal.

Algorithm 15 Construction of a tensor with prescribed, largest Tucker singular values

Input: feasible, largest Tucker singular values σ(1)1 , . . . , σ

(d)1 for a common mode size n

Output: HT tree SVD Rσ∪θ of a tensor with largest Tucker singular values σ(1)1 , . . . , σ

(d)1

1: procedure coltsv(σ(1)1 , . . . , σ

(d)1 ,n)

2: solve for feasible θ(1)1 , . . . , θ

(d−2)1 via a linear programming algorithm . cf. Eq. (7.15)

3: denote (λ(1), . . . , λ(2d−3)) := (σ(1), σ(2), θ(1), σ(3), θ(2), . . . , σ(d−1), σ(d)) . as in Eq. (7.18)

4: for µ = 1, . . . , d− 2 do . can be performed in parallel5: construct an all-orth. Sµ ∈ Cn×n×n with Tucker s.v. (λ(2µ−1), λ(2µ), λ(2µ+1))

6: set Mµ ← Sµ (diag(λ(2µ−1)+ )−1, diag(λ

(2µ)+ )−1,diag(λ

(2µ+1)+ )−1) . cf. Section 3.4.1

7: end for

8: for µ = 1, . . . , d do

9: set Nµ via N (αµ),(βµ)µ ← In . the identity matrix of size n× n

10: end for

11: return Rσ∪θ := (N ∪M, σ ∪ θ) . as in Eq. (7.17)12: end procedure

Page 192: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 193: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Chapter 8

Honeycombs and Feasibility ofSingular Values in the TTFormat

We consider the problem to determine feasibility, Definition 7.2, for the family

KTT = 1, 1, 2, . . . , 1, . . . , d− 1

of the tensor train (TT) format as in Section 3.4.1 (cf. [84,100]). The corresponding family

of singular values σ(J)J∈K is abbreviated using σ(µ)TT := σ(1,...,µ), and we recall

σTT := (σ(1)TT, . . . , σ

(d−1)TT ) := svTT(A),

σ(µ)TT := sv(A(J)), J = 1, . . . , µ.

Each σ(µ)TT ∈ D∞≥0 is considered an infinite (weakly decreasing, nonnegative) sequence with

finitely many nonzero entries (cf. Definition 7.1). We however omit the index TT in thischapter, since we are only concerned with the TT-format. Here, we also consider the fieldK = C (cf. Section 2.8.2). The feasibility problem, Definition 7.2, for the tensor train formatis then as follows.

Definition 8.1 (TT-feasibility [68]Kr19). Let σ = (σ(1), . . . , σ(d−1)) ∈ (D∞≥0)d−1 (potential

singular values). Then σ is called TT-feasible for n = (n1, . . . , nd) ∈ Nd if there exists atensor A ∈ Kn1×...×nd such that σ = svTT(A).

As in the general case, the trace property (cf. Eq. (7.2))

‖σ(µ)‖2 = ‖σ(ν)‖2, µ, ν = 1, . . . , d− 1, (8.1)

must hold true in order for σ to be feasible. Since σ is formally independent of the tensorsize, we have that if it is feasible for n, then it is also feasible for any increased valuesnµ ≥ nµ, µ = 1, . . . , d. In that sense, we are interested to find the smallest such mode sizes.In the remainder of this chapter, additional to Definition 7.1, we use the following notation:

Definition 8.2 (Negation). For n 6= ∞, the negation −v ∈ Dn≤0 of v ∈ Dn≥0 is defined as−v := (−vn, . . . ,−v1) (as in [65]).

So for γ = (4, 2, 2, 0, 0, . . .) ∈ D∞≥0, we have −γ+ = (−2,−2,−4). With a tilde, weemphasize that a tuple may contain zeros, that is γ ∈ v ∈ Dn≥0 | v+ = γ+, n ≥ deg(γ).For example, we may have γ = (4, 2, 2, 0) ∈ D4

≥0.

177

Page 194: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

178 8.1. Decoupling for the Tensor Train Format

8.1 Decoupling for the Tensor Train Format

The family KTT fulfills the hierarchy condition Eq. (7.10) and the according tensor trainformat is a tree tensor network. Chapter 7 hence shows that the conditions for TT-feasibilitycan be decoupled. These conclusion have been derived in [68]Kr19 separately for the tensortrain format. The conditions in Theorem 7.8 translated to the TT-format state that σis TT-feasible for n if and only if there exist nodes Gµ, µ = 1, . . . , d, as in Section 3.4.1

(recalling Σ(µ) = diagβµ(σ(µ)+ ), σ(µ) = σ(µ)(βµ)), such that

Σ(µ−1) Gµ is βµ-unitary and Gµ Σ(µ) is βµ−1-unitary (8.2)

for all µ = 2, . . . , d − 1, and further deg(σ(1)) ≤ n1 as well as deg(σ(d−1)) ≤ nd. Theseconditions can also be phrased indirectly as in Corollary 7.11:

Corollary 8.3 (Decoupling [68]Kr19). σ ∈ (D∞≥0)d−1 is TT-feasible for n ∈ Nd if and only

if deg(σ(1)) ≤ n1, deg(σ(d−1)) ≤ nd and for each µ = 2, . . . , d− 1, the pair (σ(µ−1), σ(µ)) isTT-feasible for (deg(σ(µ−1)), nµ,deg(σ(µ))).

For each fixed µ = 2, . . . , d− 1, the conditions Eq. (8.2) towards Gµ can, for

N(·) := Gµ(α = ·)(βµ−1),(βµ) ∈ (Krµ−1×rµ)nµ , (8.3)

Γ := (Σ(µ−1))(βµ−1),(βµ−1), Θ := (Σ(µ))(βµ),(βµ),

equivalently be phrased as the ordinary matrix equations

m∑

i=1

N(i)H Γ2 N(i) = Irµ ,

m∑

i=1

N(i) Θ2 N(i)H = Irµ−1 . (8.4)

The above notation is more common for example in the MPS [100] literature, and N issometimes referred to as core (cf. Section 2.5.1). Since, through the decoupling, onlyneighboring pairs of singular values have to be considered, we state the case d = 3 separately:

Corollary 8.4 (Node constraints [68]Kr19). For a natural number m ∈ N, a pair (γ, θ) ∈D∞≥0 × D∞≥0 is TT-feasible for the triplet (deg(γ),m,deg(θ)) if and only if there exists N ∈(Kdeg(γ)×deg(θ))m for which Eq. (8.4) holds true.

By quite simple matrix manipulation, the previous conditions can be transformed intoan eigenvalue problem.

Theorem 8.5 (Equivalence to an eigenvalue problem [68]Kr19). Let m ∈ N. A pair (γ, θ) ∈D∞≥0 × D∞≥0 is feasible for (deg(γ),m,deg(θ)) if and only if the following holds: there ex-

ist m pairs of Hermitian1, positive semidefinite matrices (A(i), B(i)) ∈ Kdeg(θ)×deg(θ) ×Kdeg(γ)×deg(γ), each with identical (multiplicities of) eigenvalues up to zeros, such thatA :=

∑mi=1A

(i) has eigenvalues θ2+ and B :=

∑mi=1B

(i) has eigenvalues γ2+ .

Proof. ([68]Kr19) (constructive) We show both directions separately.“⇒”: Let (γ, θ) be feasible for (deg(σ(µ−1)), nµ,deg(σ(µ))). Then by Corollary 8.4, forΓ = diag(γ+), Θ = diag(θ+) and a single element N ∈ (Krµ−1×rµ)nµ (cf. Eq. (8.3)), we haveboth

m∑

i=1

N(i)H Γ2 N(i) = Irµ ,

m∑

i=1

N(i) Θ2 N(i)H = Irµ−1 .

1For K = R, Hermitian is just symmetric and the conjugate transpose ·H is just the transpose ·T .

Page 195: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 179

By substitution of N = Γ−1 N Θ−1, this is equivalent to

m∑

i=1

N(i)H N(i) = Θ2,

m∑

i=1

N(i) N(i)H = Γ2. (8.5)

Now, for A(i) := N(i)H N(i) and B(i) := N(i) N(i)H , we have found matrices as desired,since the eigenvalues of A(i) and B(i) are each the same (up to zeros).“⇐”: Let A(i) and B(i) be matrices as required. Then, by eigenvalue decompositions,A = QA Θ2 QHA andB = QB Γ2 QHB for unitaryQA, QB and thereby

∑mi=1Q

HA A(i) QA = Θ2

and∑mi=1Q

HB B(i) QB = Γ2. Then again, by truncated eigenvalue decompositions of these

summands, we obtain

QHA A(i) QA = Vi Si VHi , QHB B(i) QB = Ui Si U

Hi , Si ∈ Rr×r

for r = min(deg(γ),deg(θ)), unitary (eigenvectors) Vi, Ui and shared (positive eigenvalues)

Si. With the choice N(i) := Ui S1/2i V Hi , we arrive at Eq. (8.5), which is equivalent to the

desired statement.

Remark 8.6 (Diagonalization [68]Kr19). In the situation of Theorem 8.5, we can use aneigenvalue decomposition of A to obtain

A = U Θ2 UH =

m∑

i=1

A(i) ⇔ Θ2 =

m∑

i=1

UH A(i) U,

for unitary (eigenvectors) U . This works analogously for B. Since conjugation does notchange eigenvalues, we may hence assume without loss of generality that A = Θ2 andB = Γ2 in Theorem 8.5.

8.2 Feasibility of Pairs

From the previous section, i.e. Corollary 8.3, we see that we only have to consider the TT-feasibility of pairs (γ, θ) for mode sizes (deg(γ),m,deg(θ)). In order to avoid the redundantentries deg(γ) and deg(θ), we from now on abbreviate as follows:

Definition 8.7 (Feasibility of pairs [68]Kr19). For m ∈ N, we say a pair (γ, θ) is feasible form if and only if it is TT-feasible for (deg(γ),m,deg(θ)) (cf. Definition 8.1).

As outlined in Section 7.2, the property is equivalent to the compatibility of (γ2, θ2)for (deg(γ),m) given K = 1, 1, 2. In fact, there exist several results on this topicas discussed in Section 7.2.2, e.g. that compatible pairs form a cone. In the following, weanalyze the problem from the different perspective provided by Theorem 8.5.

8.2.1 Constructive, Diagonal Feasibility

The feasibility of pairs is a reflexive and symmetric relation, but it is not transitive. In somecases, verification can be easier:

Lemma 8.8 (Diagonally feasible pairs [68]Kr19). Let (γ, θ) ∈ D∞≥0×D∞≥0 and a(1), . . . , a(m) ∈Rr≥0, r = max(deg(γ),deg(θ)), and permutations π1, . . . , πm ∈ Sr such that

a(1)i + . . .+ a

(m)i = γ2

i , a(1)π1(i) + . . .+ a

(m)πm(i) = θ2

i , i = 1, . . . , r.

Then (γ, θ) is feasible for m (we write diagonally feasible in that case). For m, r1, r2 ∈ N,γ2

+ = (1, . . . , 1) of length r1 and θ2+ = (k1, . . . , kr2) ∈ Dr2≥0 ∩ 1, . . . ,mr2 , with ‖k‖1 = r1,

the pair (γ, θ) is diagonally feasible for m.

Page 196: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

180 8.2. Feasibility of Pairs

Proof. ([68]Kr19) The given criterion is just the restriction to diagonal matrices, A(j) =

diag(a(j)πj ), B(j) = diag(a(j)), j = 1, . . . ,m, in Theorem 8.5. All sums of zero-eigenvalues,

a(j)i = 0 or a

(j)πj(i)

= 0, j = 1, . . . ,m for fixed i, can be ignored, i.e. we also find diagonal

matrices of actual sizes deg(γ) × deg(γ) and deg(θ) × deg(θ). The subsequent explicit set

of feasible pairs follows immediately by restricting a(`)i ∈ 0, 1 and by using appropriate

permutations.

For example, to show that (γ, θ), γ2+ = (1, 1, 1, 1), θ2

+ = (2, 2), is feasible for m = 2,

we can set a(1) = (1, 1, 0, 0), a(2) = (0, 0, 1, 1) and π1 = Id, π2 = (1, 3, 2, 4). The resultingmatrices in Theorem 8.5 then are B(1) = diag((1, 1, 0, 0)), B(2) = diag((0, 0, 1, 1)) as well asA(1) = A(2) = diag((1, 1)). Following the procedure in Theorem 8.5, we obtain the singleelement N for which Eq. (8.4) holds true:

N(1) =

1√2

0

0 1√2

0 00 0

, N(2) =

0 00 01√2

0

0 1√2

.

Although for m = 2, r ≤ 3, each feasible pair happens to be diagonally feasible, this doesnot hold in general. For example, the pair (γ, θ),

γ2+ = (7.5, 5) and θ2

+ = (6, 3.5, 2, 1), (8.6)

is feasible (cf. Eq. (7.9) or Fig. 8.5) for m = 2, but it is not diagonally feasible. Assume

therefore that there exist a(1)1 + a

(2)1 = γ2

1 , a(1)2 + a

(2)2 = γ2

2 . Then these four diagonal entriesmust already equal the four entries contained in θ2

+. This is however not possible.

Definition 8.9 (Set of feasible pairs [68]Kr19). We define Fm,(r1,r2) as the set of pairs (γ, θ) ∈Dr1≥0 × Dr2≥0, for which (γ, θ) = ((γ, 0, . . .), (θ, 0, . . .)) is feasible for m (cf. Definition 8.7),and

F2m,(r1,r2) := (γ2

1 , . . . , γ2r1 , θ

21, . . . , θ

2r2) | (γ, θ) ∈ Fm,(r1,r2).

The following theorem is a special case of Eq. (7.9) and features a constructive proof asoutlined below.

Theorem 8.10 (Special case [68]Kr19). Let m ∈ N. If r1, r2 ≤ m, then

Fm,(r1,r2) = Dr1≥0 ×Dr2≥0 ∩ (γ, θ) | ‖γ‖2 = ‖θ‖2,

that is, any pair (γ, θ) ∈ D∞≥0 × D∞≥0 with deg(γ),deg(θ) ≤ m, for which the trace propertyholds true, is (diagonally) feasible for m.

Proof. ([68]Kr19) We give a proof by contradiction. Set γ = (γ+, 0, . . . , 0) as well as θ =(θ+, 0, . . . , 0) such that both have length m. Let the permutation π be given by the cycle(1, . . . ,m) and π` := π`−1. For each k, let Rk := (i, `) | π`(k) = i. Now, let the

nonnegative eigenvalues a(`)i , `, i = 1, . . . ,m, form a minimizer of

w := ‖A(1, . . . , 1)T − γ2‖1 (8.7)

subject to

(i,`)∈Rka

(`)i = a

(1)π1(k) + . . .+ a

(m)πm(k) = θ2

k, k = 1, . . . ,m, (8.8)

Page 197: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 181

where A = a(`)i (i,`) (the minimizer exists since the allowed values form a compact set).

For m = 3, for example, we aim at the following, where R3 has been highlighted.

a(1)π1(1) a

(2)π2(3) a

(3)π3(2)

a(1)π1(2) a

(2)π2(1) a

(3)π3(3)

a(1)π1(3) a

(2)π2(2) a

(3)π3(1)

·

111

γ2

1

γ22

γ23

Let further

#≷ := i | a(1)i + . . .+ a

(m)i ≷ γ2

i , i = 1, . . . ,m.

As ‖γ‖2 = ‖θ‖2 by assumption, either #> and #< are both empty or both not empty. Inthe first case, we are finished. Assume therefore there is an (i, j) ∈ #> ×#<. Then there

is an index `1 such that a(`1)i > 0 as well as indices k and `2 such that (i, `1), (j, `2) ∈ Rk.

This is however a contradiction, since replacing a(`1)i ← a

(`1)i − ε and a

(`2)j ← a

(`2)j + ε for

some small enough ε > 0 is valid, but yields a lower minimum w. Hence it already holds

a(1)i + . . .+ a

(m)i = γ2

i , i = 1, . . . ,m. Due to Lemma 8.8, the pair (γ, θ) is feasible.

The entries a(`)i can be found via a linear programming algorithm, since they are given

through the linear constraints in Eq. (8.8) and, as the proof shows, Eq. (8.7) for w = 0.A corresponding core can easily be calculated subsequently, as the proof of Theorem 8.5 isconstructive.In the following section, we address theory that was subject to nearly a century of develop-ment. Fortunately, many results in that area can be transferred — last but not least becauseof the work of A. Knutson and T. Tao and their illustrative theory of honeycombs [65].

8.2.2 Weyl’s Problem and the Horn Conjecture

In 1912, H. Weyl posed a problem [103] that asks for an analysis of the following relation.

Definition 8.11 (Eigenvalues of a sum of two Hermitian matrices [65]). Let λ, µ, ν ∈ Dn.Then the relation

λ µ ∼c ν (8.9)

is defined to hold true iff there exist Hermitian matrices A,B ∈ Cn×n and C := A+B witheigenvalues λ, µ and ν, respectively. This definition is straight forwardly extended to morethan two summands, i.e.

λ(1) . . . λ(m) ∼c ν (8.10)

holds true if there exist Hermitian matrices A(1), . . . , A(m) ∈ Cn×n and C = A(1)+. . .+A(m)

with eigenvalues λ(1), . . . , λ(m) and ν, respectively2.

The relation Eq. (8.9) may equivalently be written as λ µ (−ν) ∼c 0 (cf. [65],Definition 8.2). A result which was discovered much later by Fulton [36], which we want topull forward, states that there is no difference when restricting oneself to real matrices.

Theorem 8.12 (Independency of field [36, Theorem 3]). A triplet (λ, µ, ν) occurs as eigen-

values for an associated triplet of real symmetric matrices A, B, C ∈ Rn×n if and only if itappears as one for Hermitian matrices A,B,C ∈ Cn×n.

2The symbol used in [65] only appears within such relations and hints at the addition of A and B. Itis not related to the product .

Page 198: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

182 8.2. Feasibility of Pairs

Assuming without loss of generality deg(γ) ≤ deg(θ), the condition (cf. Theorem 8.5) for

the feasibility of a pair (γ, θ) for m can now be restated as: there exist a1, . . . , am ∈ Ddeg(γ)≥0

with

a1 . . . am ∼c γ2+

and

(a1, 0, . . .) . . . (am, 0, . . .) ∼c θ2+. (8.11)

The later Theorem 8.23 uses Theorem 8.12 to confirm that the initial choice K ∈ R,C isalso irrelevant regarding the conditions for feasibility.

Weyl and Ky Fan [31] were among the first ones to give necessary, linear inequalities tothe relation Eq. (8.9). We refer to the (survey) article Honeycombs and Sums of Hermi-tian Matrices3 [65] by Knutson and Tao, which has been the main point of reference forthe remaining part and serves as historical survey as well (see also [8]). We use parts oftheir notation as long as we remain within this topic. Therefore, m remains the number ofmatrices (m = 2 in Definition 8.11), but n denotes the size of the Hermitian matrices and ris used as index. A. Horn introduced the famous Horn conjecture in 1962:

Theorem 8.13 ((Verified) Horn conjecture [58]). There is a specific set Tr,n (defined forexample in [8]) of triplets of monotonically increasing r-tuples such that: The relation λ µ ∼c ν is satisfied if and only if for each (i, j, k) ∈ Tr,n, r = 1, . . . , n− 1, the inequality

νk1 + . . .+ νkr ≤ λi1 + . . .+ λir + µj1 + . . .+ µjr (8.12)

holds true, as well as the trace property∑ni=1 λi +

∑ni=1 µi =

∑ni=1 νi.

As already indicated, the conjecture is correct, as proven through the contributions ofKnutson and Tao (cf. Section 8.3) and Klyachko [62]. Fascinatingly, the quite inaccessible,recursively defined set Tr,n can in turn be described by eigenvalue relations themselves, asstated by W. Fulton [36].

Theorem 8.14 (Description of Tr,n [36, 58, 65]). Let 4` := (`r − r, `r−1 − (r − 1), . . . , `2 −2, `1 − 1) ∈ Dr≥0 for any set or tuple ` of r increasing natural numbers. The triplet (i, j, k)of such is in Tr,n if and only if for the corresponding triplet it holds 4i4j ∼c 4k.

Even with just diagonal matrices, one can thereby derive various (possibly all) tripletsin Tr,n. For example, Ky Fan’s inequality [31],

k∑

i=1

νi ≤k∑

i=1

λi +

k∑

i=1

µi,

relates to the simple 0 0 ∼c 0 ∈ Rk, k = 1, . . . , n. Further, Weyl’s inequality [103], forarbitrary i, j ∈ 1, . . . , n,

νi+j−1 ≤ λi + µj (8.13)

is proven through i− 1 j − 1 ∼c i+ j − 2 ∈ R. A further interesting property, as alreadyshown by Horn, is given if Eq. (8.12) holds as equality:

Lemma 8.15 (Splitting [58,65]). Let (i, j, k) ∈ Tr,n and λµ ∼c ν. Further, let ic, jc, kc betheir complementary indices with respect to 1, . . . , n. Then the following statements areequivalent:

3To the best of our knowledge, in Conjecture 1 (Horn conjecture) on page 176 of the AMS publication,the relation ≥ needs to be replaced by ≤. This is a mere typo without any consequences and the authorsare most likely aware of it by now.

Page 199: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 183

• νi1 + . . .+ νir = λi1 + . . .+ λir + µj1 + . . .+ µjr

• Any associated triplet of Hermitian matrices (A,B,C) is block diagonalizable into twoparts, which contain eigenvalues indexed by (i, j, k) and (ic, jc, kc), respectively.

• λ|i µ|j ∼c ν|k• λ|ic µ|jc ∼c ν|kcThe relation is in that sense split into two with respect to the triplet (i, j, k). For example,

given ν2 = λ1 + µ2 (cf. Eq. (8.13) for i = 1, j = 2) then (ν1, ν2, ν3) ∼c (λ1, λ2, λ3) (µ1, µ2, µ3) if and only if (ν1, ν3) ∼c (λ2, λ3) (µ1, µ3).

8.3 Honeycombs and Hives

The following result by Knutson and Tao poses a complete resolution to Weyl’s problem andis based on preceding breakthroughs [52, 62, 64, 66]. This problem has since then also beengeneralized, for example [34,37].

8.3.1 Honeycombs and Eigenvalues of Sums of Hermitian Matrices

Honeycombs, for which the article [65] provides a good understanding, are a central tool inthe verification of the Horn conjecture. They allow graph theory as well as linear program-ming to be applied to Weyl’s problem. A honeycomb h (cf. Fig. 8.1) is a two-dimensionalobject, embedded into

h ⊂ R3∑=0 := x ∈ R3 | x1 + x2 + x3 = 0. (8.14)

It consists of line segments (edges or rays), each parallel to one of the cardinal directions(0, 1,−1) (northwest), (−1, 0, 1) (northeast) or (1,−1, 0) (south), as well as vertices, wherethose join (the point of view from which we plot honeycombs is in the orthogonal direction(1, 1, 1)).

Each segment has exactly one constant coordinate, the collection of which we formallydenote with edge(h) ∈ RN , N = 3

2n(n + 1) (including the boundary rays). Nondegeneraten-honeycombs follow one identical topological structure and are identifiable through linearconstraints: the constant coordinates of three edges meeting at a vertex add up to zero, andevery edge has strictly positive length. This leads to one archetype, as displayed in Fig. 8.1(for n = 3). The involved eigenvalues appear as boundary values (west, east and south)

δ(h) := (w(h), e(h), s(h)) := (λ, µ,−ν) ∈ (Dn)3, (8.15)

i.e. the constant coordinates of the outer rays, as depicted in the plot. The set HONEYn

of all n-honeycombs is identified as the closure of the set of nondegenerate ones, allowingedges of length zero as well. Thereby,

C = edge(h) | h ∈ HONEYn ⊂ RN (8.16)

is a closed, convex, polyhedral cone.

Theorem 8.16 (Relation to honeycombs [65]). The relation λ µ ∼c ν is satisfied if andonly if there exists a honeycomb h with boundary values δ(h) = (λ, µ,−ν).

The set of triplets (λ, µ,−ν) ∈ (Dn)3 | λ µ ∼c ν thus equals

BDRYn := δ(h) | h ∈ HONEYn, (8.17)

which is at the same time the orthogonal projection of the cone C (Eq. (8.16)) to the co-ordinates associated with the boundary (the rays) — and, as shown in its verification, the

Page 200: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

184 8.3. Honeycombs and Hives

λ3= 2 µ1= 6

ν3= -10

λ2= 4 µ2= 4

ν2= -8

λ1= 6 µ3= 2

ν1= -6

λ3= 3.5 µ1= 6.5

ν3= -10

λ2= 4.5 µ2= 5.5

ν2= -9.5

λ1= 6 µ3= 2.5

ν1= -5

Figure 8.1: ([68]Kr19) Left: The archetype of nondegenerate (n = 3)-honeycombs as described in Section 8.3.The rays pointing in directions northwest, northeast and south have constant coordinates w(h)i = λi,e(h)i = µi and s(h)i = −νi, respectively. The remaining line segments contribute to the total edge lengthof the honeycomb. Right: A degenerate honeycomb, where the line segment at the top has been completelycontracted. Here, only eight line segments remain to contribute to the total edge length.

very same cone described by the (in)equalities in Theorem 8.13.

As mentioned, we can assign an underlying graph G = (V,E) to each honeycomb (cf. [65]).For non degenerate ones, the vertices V are given through the meeting points of each threeline segments, i.e. the Y -crossings (possibly upside down). The edges E are simply the con-nections between these vertices. So the graph is identical to the honeycomb without rays,whereas the latter ones may be interpreted as legs4. In case of a degenerate line segment (oflength zero), the two corresponding vertices are removed from the graph. Hence, the linesmeeting at X-crossings do not form a vertex, but are thought to miss each other.

In the right honeycomb in Fig. 8.1, there are hence two vertices and three edges less thanin the left, nondegenerate one. In the special case E = ∅, the relation λ µ ∼c ν is thenrealized by diagonal matrices A,B,C, with permuted entries λ, µ,−ν, respectively. Thesepermutations depend on which rays meet each others. In Fig. 8.2 for example, we have

A = diag(λ2, λ1, λ3), B = diag(µ1, µ3, µ2), C = diag(ν3, ν2, ν1) = A+B

λ = (5, 4, 2.5), µ = (5, 4.5, 3), − ν = (−7,−8,−9)

There is also a related statement implicated by those in Lemma 8.15. If a triplet (i, j, k) ∈

λ3= 2.5 µ1= 5

ν3= -9

λ2= 4

µ2= 4.5

ν2= -8

λ1= 5 µ3= 3

ν1= -7

Figure 8.2: A degenerate honeycomb for which the underlying graph has no edges and only n vertices.

Tr,n yields an equality as in Eq. (8.12), then for the associated honeycomb h, δ(h) =(λ, µ,−ν), it holds

h = h1 ⊗ h2, δ(h1) = (λ|i, µ|j ,−ν|k), δ(h2) = (λ|ic , µ|jc ,−ν|kc), (8.18)

4We later connect such legs to new edges in so called hives, but despite the resemblance, there is noformal relation to those graphs corresponding to tensor node networks.

Page 201: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 185

which means that h is a literal overlay of two smaller honeycombs. Vice versa, if a honeycombis an overlay of two smaller ones, then it yields two separate eigenvalue relations. Howeverthe splitting does not necessarily correspond to a triplet in Tr,n [65]. An example for this isFig. 8.2, where the honeycombs consists of three smaller, overlayed ones, and the eigenvalueequation is diagonalized.

8.3.2 Hives and Feasibility of Pairs

Definition 8.17 (Positive semidefinite honeycomb [68]Kr19). We define a positive semidefi-nite honeycomb h as a honeycomb with boundary values w(h), e(h) ≥ 0 and s(h) ≤ 0.

A honeycomb can connect three matrices. In order to connect m matrices, chains orsystems of honeycombs are put in relation to each other through their boundary values.Although the phrase hive has appeared before as similar object to honeycombs, to whichwe do not intend to refer here, we use it to emphasize that a collection of honeycombs isgiven5. Considerations for simple chains of honeycombs (cf. Lemma 8.21) have also beenmade in [64,66], but we need to rephrase these ideas for our own purposes.

Definition 8.18 (Hives [68]Kr19). Let n,M ∈ N. We define a (pos. semidefinite) (n,M)-hiveH as a collection of M (pos. semidefinite) n-honeycombs h(1), . . . , h(M).

Definition 8.19 (Structure of hives [68]Kr19). Let H be an (n,M)-hive and

B := (i, b) | i = 1, . . . ,M, b ∈ w, e, s.

Further, let ∼S ∈ B × B be an equivalence relation. We say H has structure ∼S if thefollowing holds:Provided (i, b) ∼S (j, p), then if both b and p or neither of them equal s, it holds b(h(i)) =p(h(j)), or otherwise b(h(i)) = −p(h(j)).

We define the hive set HIVEn,M (∼S) as set of all (n,M)-hives H with structure ∼S.

In order to specify a structure ∼S , we will only list generating sets of equivalences (withrespect to reflexivity, symmetry and transitivity).

Definition 8.20 (Boundary map of structured hives [68]Kr19). Let H be an (n,M)-hive withstructure ∼S. Further, let

P := (i, b) | |[(i, b)]∼S | = 1

be the set of singletons. We define the boundary map δP : HIVEn,M (∼S) → (Dn)P to mapany hive H ∈ HIVEn,M (∼S) to the function fP : P → Dn defined via:For all (i, b) ∈ P , if b equals s, it holds fP (i, b) = −b(h(i)), or otherwise fP (i, b) = b(h(i)).

A single n-honeycomb h with boundary values (λ, µ,−ν) can hence be identified as(n, 1)-hive H with trivial structure ∼S generated by the empty set, singleton set P =(1,w), (1, e), (1, s) and boundary δP (H) = (1,w) 7→ λ, (1, e) 7→ µ, (1, s) 7→ ν6. Inthis sense, it holds HONEYn

∼= HIVEn,1(∅) and we regard honeycombs as hives as well.Another example is illustrated in Fig. 8.3, where ∼S is generated by (1, s) ∼S (2,w) and(2, s) ∼S (3,w), such that the singletons are P = (1,w), (1, e), (2, e), (3, e), (3, s).Lemma 8.21 (Eigenvalues of a sums of matrices [68]Kr19). The relationa(1) . . . a(m) ∼c c is satisfied if and only if there exists a hive H of size M = m − 1(cf. Fig. 8.3) with structure ∼S, generated by (i, s) ∼S (i + 1,w), i = 1, . . . ,M − 1, andδP (H) = (1,w) 7→ a(1), (1, e) 7→ a(2), (2, e) 7→ a(3), . . . , (M, e) 7→ a(m), (M, s) 7→ c.

5in absence of further bee related vocabulary6This denotes fP (1,w) = λ, fP (1, e) = µ, fP (1, s) = ν for fP = δP (H).

Page 202: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

186 8.3. Honeycombs and Hives

Proof. “⇒”: ([68]Kr19) The relation a(1) . . . a(m) ∼c c is equivalent to the existence ofHermitian (or real symmetric, cf. Theorem 8.12) matrices A(1), . . . , A(m), C = A(1) + . . .+A(m) with eigenvalues a(1), . . . , a(m), c, respectively. For

A(1,...,k+1) := A(1,...,k) +A(k+1), k = 1, . . . ,m− 1,

with accordant eigenvalues a(1,...,k), the relation can equivalently be restated as

a(1,...,k) a(k+1) ∼c a(1,...,k+1), k = 1, . . . ,m− 1.

This in turn is equivalent to the existence of honeycombs h(1), . . . , h(m−1) with bound-ary values δ(h(1)) = (a(1), a(2),−a(1,2)), δ(h(2)) = (a(1,2), a(3),−a(1,2,3)), . . ., δ(h(m−1)) =(a(1,...,m−1), a(m),−c). This depicts the structure ∼S and boundary function δP (H).

“⇐”: If in reverse the hive H is assumed to exist, then we know, via the single hon-eycombs, that there exist matrices A(1,...,k+1) = A(1,...,k) + A(k+1), k = 1, . . . ,m − 1with corresponding eigenvalues. We however only know that A(1,...,k+1) and A(1,...,k+1)

share eigenvalues. We therefore show, by induction over k, that we can find matricesB(1,...,k+1) = B(1|k) + . . . + B(k+1|k) with eigenvalues a(1,...,k+1), a(1), . . . , a(k+1), respec-tively. The statement is given for k = 1, so let it also be true for some k > 1. Throughdiagonalization we obtain

T−11 A(1,...,k+1)T1 = diag(a(1,...,k)) + T−1

1 A(k+1)T1

T−12 B(1,...,k)T2 = diag(a(1,...,k)) = T−1

2 B(1|k−1)T2 + . . .+ T−12 B(k|k−1)T2

Combining these yields

T−11 A(1,...,k+1)T1 = T−1

2 B(1|k−1)T2 + . . .+ T−12 B(k|k−1)T2 + T−1

1 A(k+1)T1.

We hence define B(1,...,k) = T−11 A(1,...,k+1)T1, as well as B(1|k) = T−1

2 B(1|k−1)T2, . . .,B(k|k) = T−1

2 B(k|k−1)T2 as well as B(k+1|k) = T−11 A(k+1)T1. Since c = a(1,...,m), the last

step k = m− 1 shows a(1) . . . a(m) ∼c c.

h(1) h(2) h(3)

a(1) a(2)

−a(1,2)

a(1,2) a(3)

−a(1,2,3)

a(1,2,3) a(4)

−c

Figure 8.3: ([68]Kr19) The schematic display of an (n, 3)-hive H with structure ∼S as in Lemma 8.21.Northwest, northeast and south rays correspond to the boundary values w(hi), e(hi) and s(hi), respectively.Coupled boundaries are in gray and connected by dashed lines.

If we identify the vertices of the single graphs G(i) of each honeycomb h(i) in a hive H =(h(1), . . . , h(m)) in the same way as we identify their boundary rays with each other, then thisyields the graph GH = (VH , EH) underlying the hive. If in fact EH = ∅, then the matricesA(k), k = 1, . . . ,m, in Lemma 8.21 are again diagonal. The idea behind honeycomb overlays(cf. Eq. (8.18)) can be transferred as well. A special case is as follows.

Lemma 8.22 (Zero eigenvalues [68]Kr19). If the relation a(1) . . .a(m) ∼c c is satisfied for

a(i) ∈ Dn≥0, i = 1, . . . ,m, and cn = 0, then a(1)n = . . . = a

(m)n = 0 and already a(1)|1,...,n−1

. . . a(m)|1,...,n−1 ∼c c|1,...,n−1.

Page 203: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 187

Proof. ([68]Kr19) The first statement follows by basic linear algebra, since a(1), . . . , a(m) arenonnegative. For the second part, Lemma 8.21 and Eq. (8.18) are used. Inductively, in eachhoneycomb of the corresponding hive H, a separate 1-honeycomb with boundary values(0, 0, 0) can be found. Hence, each honeycomb is an overlay of such a 1-honeycomb and an(n− 1)-honeycomb. All remaining (n− 1)-honeycombs then form a new hive with identicalstructure ∼S .

We arrive at an extended version of Theorem 8.5.

Theorem 8.23 (Equivalence to existence of a hive [68]Kr19). Let (γ, θ) ∈ D∞≥0 × D∞≥0

and n ≥ deg(γ),deg(θ). Further, let θ = (θ+, 0, . . . , 0), γ = (γ+, 0, . . . , 0) be n-tuples.The following statements are equivalent, independent of the choice K ∈ R,C:• The pair (γ, θ) is feasible for m ∈ N• There are m pairs of Hermitian, positive semidefinite matrices (A(i), B(i)) ∈Cn×n × Cn×n, each with identical (multiplicities of) eigenvalues, such that A :=∑mi=1A

(i) has eigenvalues θ2 and B :=∑mi=1B

(i) has eigenvalues γ2, respectively.

• There exist a(1), . . . , a(m) ∈ Dn≥0 such that a(1) . . . a(m) ∼c γ2 as well as

a(1) . . . a(m) ∼c θ2.

• There exists a positive semidefinite (n,M)-hive H of size M = 2(m − 1) (cf.Fig. 8.4) with structure ∼S, where (i+u, s) ∼S (i+ 1 +u,w), i = 1, . . . ,M/2−1,u ∈ 0,M/2, as well as (1,w) ∼S (1 + M/2,w) and (i, e) ∼S (i + M/2, e),

i = 1, . . . ,M . Further, δP (H) = (M/2, s) 7→ γ2, (M, s) 7→ θ2.

Proof. ([68]Kr19) The existence of matrices with actual size deg(γ), deg(θ), respectively,follows by repeated application of Lemma 8.22. The hive essentially consists of two rowsof honeycombs as in Lemma 8.21. Therefore, the same argumentation holds, but instead ofprescribed boundary values a(i), these values are coupled between the two hive parts. Dueto Theorem 8.12, there is no difference whether we consider real or complex matrices andtensors.

h(1) h(2) h(3)

a(1) a(2)

−a(1,2)up

a(1,2)up a(3)

−a(1,2,3)up

a(1,2,3)up a(4)

−γ2

h(4) h(5) h(6)

a(1) a(2)

−a(1,2)low

a(1,2)low a(3)

−a(1,2,3)low

a(1,2,3)low a(4)

−θ2

Figure 8.4: ([68]Kr19) The schematic display of an (n, 6)-hive H (upper part in blue, lower part in magenta)with structure ∼S as in Lemma 8.21. Northwest, northeast and south rays correspond to the boundary valuesw(hi), e(hi) and s(hi), respectively. Coupled boundaries are in gray and connected by dashed lines.

The feasibility of (γ, θ) as in Eq. (8.6) is provided by the hive in Fig. 8.5. Even though notdiagonally feasible, the pair can be disassembled, as later shown in Section 8.4.3, into mul-tiple, diagonally feasible pairs, which then as well prove its feasibility. As another example

Page 204: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

188 8.3. Honeycombs and Hives

a4 b1

-γ1

a3

b2

-γ2

a2

b3

-γ3

a1 b4

-γ4

-θ1

-θ2

-θ3

-θ4

Figure 8.5: ([68]Kr19) A (4, 2)-hive consisting of two coupled honeycombs (blue for γ, magenta for θ), whichare slightly shifted for better visibility, generated by Algorithm 16. Note that some lines have multiplicity 2.The coupled boundary values are given by a = (4, 1.5, 0, 0) and b = (3.5, 3.5, 0, 0). It proves the feasibility of

the pair (γ, θ), γ2 = (7.5, 5, 0, 0), θ2 = (6, 3.5, 2, 1) for m = 2, since γ2, θ2 ∼c a b (the exponent 2 has beenskipped for better readability). Only due to the short, vertical line segment in the middle, the hive does notprovide diagonal feasibility.

serves γ2+ = (10, 2, 1, 0.25, 0.25) and θ2

+ = (4, 3, 2.5, 2, 2). According to Eq. (7.9), the pair(γ, θ) is not feasible for m = 2, 3, but may be feasible for m = 4. The hive in Figs. 8.6 and 8.7(having been constructed with Algorithm 16) provides that this is indeed the case. We fur-ther know that the pair is diagonally feasible for m = 5 (due the constructive Theorem 8.10).

a5 b1

a4

b2

a3

b3

a2

b4a1b5

c5

c4

c3

c2

c1

d1

-γ1

d2

-γ2

d3

-γ3

d4

-γ4

d5

-γ5

-θ1

-θ2-θ3-θ4-θ5

Figure 8.6: ([68]Kr19) A (5, 4)-hive consisting of six coupled honeycombs (blue for γ, magenta for θ), whichare slightly shifted for better visibility, generated by Algorithm 16. Note that some lines have multiplicitylarger than 1. Also, in each second pair of honeycombs, the roles of boundaries λ and µ have been switched(which we can do due to the symmetry regarding ), such that the honeycombs can be combined to a singlediagram as in Fig. 8.7. This means that the south rays of an odd-numbered pair are always connectedto the north-east (instead of north-west) rays of the consecutive pair. The boundary values are given bya = (2, 0.25, 0.25, 0, 0), b = (1, 1, 0.25, 0, 0), c = (4, 0, 0, 0, 0) and d = (3, 1, 0.75, 0, 0). It proves the feasibility

of the pair (γ, θ), γ2 = (10, 2, 1, 0.25, 0.25), θ2 = (4, 3, 2.5, 2, 2) for m = 4, since both γ2, θ2 ∼c a b c d(the exponent 2 has been skipped for better readability).

8.3.3 Hives are Polyhedral Cones

As previously done for honeycombs, we also associate hives with certain vector spaces.

Definition 8.24 (Hive sets and edge image [68]Kr19). Let H be an (n,M)-hive consisting ofhoneycombs h(1), . . . , h(M). We define

edge(H) = (edge(h(1)), . . . , edge(h(M))) ∈ RN×M

as the collection of constant coordinates of all edges appearing in the honeycombs within thehive H. Although defined via the abstract set B (in Definition 8.18), we let ∼S act on therelated edge coordinates as well. For H ∈ HIVEn,M (∼S), we then define the edge image

as edgeS(H) ∈ RN×M/∼S ∼= RN∗ , in which coupled boundaries are assigned the same

coordinate.

Page 205: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 189

a5

b1

a4

b2

a3

b3

a2

b4

a1

b5

c5c4c3c2

c1

d1

-γ1

d2

-γ2

d3

-γ3

d4

-γ4

d5

-γ5

-θ1

-θ2-θ3-θ4-θ5

Figure 8.7: ([68]Kr19) The three overlayed honeycomb pairs in Fig. 8.6 put together with respect to theircoupling (the exponent 2 for γ, θ has been skipped for better readability).

Theorem 8.25 (Hive sets are described by polyhedral cones [68]Kr19).

• The hive set HIVEn,M (∼S), is a closed, convex, polyhedral cone, i.e. there exist ma-trices L1, L2 s.t. edgeS(HIVEn,M (∼S)) = x | L1x ≤ 0, L2x = 0.• Each fiber of δP (i.e. a set of hives with structure ∼S and boundary fP ), forms

a closed, convex polyhedron, i.e. there exist matrices L1, L2, L3 and a vector b s.t.edgeS(δ−1

P (fP )) = x | L1x ≤ 0, L2x = 0, L3x = b.

Proof. Each honeycomb of a hive follows its linear constraints. The hive structure andidentification of coordinates as one and the same by ∼S only imposes additional linearconstraints. The rest is elementary geometry.

Corollary 8.26 (Boundary of hives [68]Kr19). The boundary set

BDRYn,M (∼S) := image(fP ) ∈ (Dn)P | fP = δP (H), H ∈ HIVEn,M (∼S)

forms a closed, convex, polyhedral cone. This hence also holds for any intersection with, orprojection to a lower-dimensional subspace.

Proof. The boundary set is given by the projection of edgeS(HIVEn,M (∼S)) to the subsetof coordinates associated to those in P . The proof is finished, since projections to fewercoordinates of closed, convex, polyhedral cones are again such cones. The same holds forintersections with subspaces.

8.4 Cones of Squared Feasible Values

The following fact has already been established in [18] and is partially a special case ofTheorem 7.13, but now also follows from the previous Corollary 8.26.

Corollary 8.27 (Squared feasible pairs form cones [68]Kr19). Let m, r1, r2 ∈ N. The setof squared feasible pairs F2

m,(r1,r2) (cf. Definition 8.9) is a closed, convex, polyhedral cone,

embedded into Rr1+r2 . If r1 ≤ mr2 and r2 ≤ mr1, then its dimension is r1 + r2 − 1.Otherwise, F2

n,(r1,r2) ∩ Dr1>0 ×Dr2>0 is empty.

Proof. ([68]Kr19) By Corollary 8.26 and Theorem 8.23 it directly follows that F2m,(r1,r2) is a

closed, convex, polyhedral cone. For the first case, it only remains to show that the cone has

Page 206: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

190 8.4. Cones of Squared Feasible Values

dimension r1+r2−1, or equivalently, it contains as many linearly independent vectors. Theseare however already given by the examples carried out in Lemma 8.8. From Corollary 8.4,it directly follows that if (γ, θ) is feasible for m, then it must hold deg(γ) ≤ m deg(θ) anddeg(θ) ≤ m deg(γ), which provides the second case.

The implication for the original TT-feasibility then is:

Corollary 8.28 (Cone property for higher-order tensors [68]Kr19). For d ∈ N, let both σ, τ ∈(D∞≥0)d−1 be TT-feasible for n ∈ Nd (in the sense of Definition 8.1). Then υ, (υ(µ))2 :=

(σ(µ))2 + (τ (µ))2, µ = 1, . . . , d− 1, is TT-feasible for n as well.

More general, squared feasible TT-singular values form a closed, convex, polyhedral cone.Its H-description is the collection of linear constraints for the pairs (σ(µ−1), σ(µ)).

Proof. Due to Corollary 8.3, it only remains to show that each pair (υ(µ−1), υ(µ)) is feasiblefor nµ, µ = 1, . . . , d. For each single µ, this follows directly from Corollary 8.27.

8.4.1 Necessary Inequalities

While for each specific m and r1, the results in [18] allow to calculate the H-description of thecone F2

m,(r1,mr1) (i.e. a set of necessary and sufficient inequalities), we will concern ourselveswith possibly weaker, but generalized statements for arbitrary m ∈ N in this section. In thesubsequent Section 8.4.3, we derive a V -description of F2

m,(m,m2) (i.e. a set of generating

vertices).

Lemma 8.29 ([68]Kr19). For n,m ∈ N, let T (j), I(j) ⊂ 1, . . . , n be sets of equal cardinality,j = 1, . . . ,m, with T (1) = I(1) and ∆T (j) ∼c ∆T (j−1) ∆I(j) (cf. Theorem 8.14) forj = 2, . . . ,m. Then, provided ζ ∼c a(1) . . . a(m), the inequality

i∈T (m)

ζi ≤m∑

j=1

i∈I(j)a

(j)i (8.19)

holds true, for every a(j), ζ ∈ Dn, j = 1, . . . ,m. If Eq. (8.19) holds as equality, then alreadyζ|T (m) ∼c a(1)|I(1) . . . a(m)|I(m) and ζ|(T (m))c ∼c a(1)|(I(1))c . . . a(m)|(I(m))c . (cf.Lemma 8.15).

Proof. ([68]Kr19) The statement Eq. (8.19) follows inductively, if for each j = 2, . . . ,m,∑

i∈T (j)

νi ≤∑

i∈T (j−1)

λi +∑

i∈I(j)µi (8.20)

is true whenever ν ∼c λ µ. By Theorem 8.14, this holds since by assumption ∆T (j) ∼c∆T (j−1)∆I(j) for j = 2, . . . ,m. If Eq. (8.19) holds as equality, then all single inequalitiesEq. (8.20) must hold as equality, and hence Lemma 8.15 can be applied inductively aswell.

Theorem 8.30 ([68]Kr19). In the situation of Lemma 8.29, let T and I fulfill the same

assumptions as T and I. Further, let I(j) ∩ I(j) = ∅, j = 1, . . . ,m. If the pair (γ, θ) ∈D∞≥0 ×D∞≥0 is feasible for m, then

i∈T (m)

γ2i ≤

i∈1,...,deg(θ)\T (m)

θ2i (8.21)

must hold true. If Eq. (8.21) holds as equality, then ((γ|T (m) , 0, . . .), (θ|(T (m))c , 0, . . .))

Page 207: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 191

and ((γ|(T (m))c , 0, . . .), (θ|T (m) , 0, . . .)) are already feasible.

Together with Eq. (8.18) this also implies that the corresponding hive is an overlay of twosmaller hives modulo zero boundaries.

Proof. ([68]Kr19) Let n ≥ max(T (m)),deg(γ),deg(θ). As (γ, θ) is feasible, due to Lemma 8.29,the inequality Eq. (8.19) holds for some joint eigenvalues a(1), . . . , a(m) ∈ Dn≥0 for both

ζ := γ2 = (γ21 , . . . , γ

2n), T , I and ζ = θ2 := (θ2

1, . . . , θ2n), T , I. Furthermore, we have∑n

i=1 θ2i =

∑ni=1 a

(1)i + . . . +

∑ni=1 a

(m)i . Subtracting Eq. (8.19) for ζ from this equality

yields

i/∈T (m)

θ2i

n≥deg(θ)=

i∈1,...,n\T (m)

θ2i

Eq. (8.19) for ζ≥

m∑

j=1

i∈1,...,n\I(j)a

(j)i (8.22)

a(j)i ≥0

≥m∑

j=1

i∈I(j)a

(j)i

Eq. (8.19) for ζ≥

i∈T (m)

γ2i . (8.23)

This finishes the first part. In case of an equality, since the second “≥” must hold as equality,we have a(j)|1,...,n\I(j) = (a(j)|I(j) , 0, . . .) and a(j)|1,...,n\I(j) = (a(j)|I(j) , 0, . . .) for each

j = 1, . . . ,m. Furthermore, the first and third “≥” in Eq. (8.23) must hold as equality aswell. Hence, the latter statement in Lemma 8.29 can be applied to the inequalities Eq. (8.19)

for both ζ and ζ, such that we can conclude the latter statement in this corollary.

Corollary 8.31 (A set of inequalities for feasible pairs [68]Kr19). Let p(1) ∪ p(2) = N be twodisjoint sets, with p(1) finite of size r. If (γ, θ) ∈ D∞≥0 × D∞≥0 is feasible for m ∈ N, then it

holds that (p(u)i being the i-th smallest element)

i∈P (1)m

γ2i ≤

i/∈P (2)m

θ2i , P (u)

m := m(p(u)i − i) + i | i = 1, 2, . . ., u = 1, 2.

Proof. ([68]Kr19) Let n ≥ max(P(1)m ),deg(γ),deg(θ). Further, let P

(2)j contain the k smallest

elements of P(2)j , where k is the number of elements in P

(2)m ∩1, . . . , n, and let P

(1)j = P

(1)j ,

j = 1, . . . ,m. Thereby P(1)1 = p(1) = P

(1)1 and P

(2)1 ⊂ p(2) = P

(2)1 . We have the following

(diagonal) matrix identities

diag(P(u)j )− diag(1, . . . , `) = diag(P

(u)j−1) + diag(P

(u)1 )− 2 diag(1, . . . , `)

⇔ j(p(u)i − i) = (j − 1)(p

(u)i − i) + (p

(u)i − i), i = 1, . . . , `, ` = |P (u)

j |

where the diagonal elements are placed in ascending order. Hence,4P (u)j ∼c 4P (u)

j−14P(u)1

for j = 2, . . . ,m, u ∈ 1, 2. For T (j) := P(1)j , I(j) := P

(1)1 and T (j) := P

(2)j , I(j) := P

(2)1 ,

we can apply Theorem 8.30 to obtain the desired statement.

Among the various inequalities contained in Corollary 8.31, the following two correspondto early mentioned inequalities for Weyl’s problem. The first case is Eq. (7.9) and is alsoreferred to as the basic inequalities in [18].

Corollary 8.32 (Ky Fan analogue for feasible pairs [68]Kr19). The choice a(1) = 1, . . . , rin Corollary 8.31 yields the inequality

r∑

i=1

γ2i ≤

mr∑

i=1

θ2i . (8.24)

Page 208: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

192 8.4. Cones of Squared Feasible Values

Corollary 8.33 (Weyl analogue for feasible pairs [68]Kr19). The choice a(1) = r + 1 inCorollary 8.31 yields the inequality

γ2rm+1 ≤

r+m∑

i=r+1

θ2i . (8.25)

The QMP article [18] explicitly provides the derivation for the case deg(γ) ≤ 3 andm = 2. Thereby, the necessary (and sufficient) inequalities for the feasibility of (γ, θ), apartfrom the trace property, are as follows: Corollary 8.32 for r = 1, 2; Corollary 8.33 for r = 1and

γ22 + γ2

3 ≤ θ21 + θ2

2 + θ23 + θ2

6, (8.26)

which corresponds to Eq. (7.8). The last inequality is not included in Corollary 8.31, butcan be derived from Theorem 8.30 and be generalized in different ways. For example, forI(1) = I(2) = 1, 3, T (2) = 2, 3, I(1) = I(2) = 2, 4, 5, 6, . . ., T (2) = 4, 5, 7, 8, . . . and

I(j) = 1, 2, T (j) = 2, 3, I(j) = 3, 4, 5, 6, . . ., T (j) = 2j, 2j + 1, 2j + 3, 2j + 4, . . .,j = 3, . . . ,m, (where we add the same amount of arbitrarily many consecutive numbers in

I(j) and T (j)) one can conclude that

γ22 + γ2

3 ≤2m−1∑

i=1

θ2i + θ2

2m+2 (8.27)

whenever (γ, θ) is feasible for m. Theorem 8.30 does however not provide when this gener-alized inequality is redundant to other necessary ones.

The right sum in Corollary 8.31 has always m times as many summands as the left sum.

For these inequalities, it further holds∑i/∈P (2)

mi−∑

i∈P (1)mi =

∑mki=k+1 i = k(m−1)((m+1)k+1)

2 ,

where k = |P (1)m |. While this difference in summed indices also holds for Eq. (8.27), we can

however only conjecture that this holds in general for every inequality in the H-descriptionof F2

m,(r1,mr1).

8.4.2 Rates of Exponential Decay

In practice, of particular interest is the relation between different rates of exponential decayof the different tuples of singular values that appear within one tensor format. For eachneighboring pair (γ, θ), each inequality of Corollary 8.31 yields a further bound on theserates. We here content ourselves with the most restrictive one:

Lemma 8.34 (Exponential decay). Let v, w > 0. Further, let (γ, θ) be given by

γi = ae−iv, θi = be−iw, i ∈ N,where a, b are such that ‖γ‖ = ‖θ‖. Then if Eq. (8.24) holds true for any one r ∈ N, then

v ∈ [w

m,wm], w ∈ [

v

m, vm].

The rate of decay is hence at most perturbed by the factor m.

Proof. The Ky Fan inequality Eq. (8.24) translates into

a

r∑

i=1

e−2iv ≤ bmr∑

i=1

e−2iw.

A short calculation shows that this inequality is equivalent to v ≤ mw. The other boundsfollow due to reflexivity of feasibility.

Page 209: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 193

We have not called (γ, θ) feasible only for formal reasons, since both have infinitely manynonzero entries, but the relation approximately holds true for any finite version. Alreadyfor a moderate mode size m, it shows how weak the bounds on exponential decay are whichcan be derived from the Ky Fan inequalities, considering that these become even weaker thefurther away two singular value tuples are within the tensor train format.

8.4.3 Vertex Description of F2m,(m,m2)

We revisit the special case Eq. (7.9) and derive the vertex description of the correspondingcone F2

m,(m,m2) (cf. Definition 8.9). In this section, for a, b ∈ N ∪ 0, let therefor

(a#b) := (a, . . . , a) ∈ Db≥0 (length b).

Lemma 8.35 ([68]Kr19). Let α, β,m ∈ N, β ≤ m, α ≤ βm, γ2+ = (α#β) and θ2

+ = (β#α).Then (γ, θ) is feasible for m.

Proof. ([68]Kr19) We prove by induction over m. Without loss of generality, we may assumeα > β by which α = kβ + t for unique natural numbers k < m, t < β. ConsideringRemark 8.6, it suffices to show that for

γ2 := γ2+ − (t, β#β−1) = (kβ, (α− β)#β−1),

θ2 := θ2+ − (0, . . . , 0, t, β#β−1) = (β#α−β , β − t, 0#β−1),

the pair ((γ, 0, . . .), (θ, 0, . . .)) is feasible for m − 1. In order to show this, we split γ =

(γ(1), γ(2)), θ = (θ(1), θ(2)) into two pairs

γ2(1) := (kβ), γ2

(2) := (β#k),

θ2(1) := ((α− β)#β−1, 0, . . . , 0), θ2

(2) := (β#v, β − t, 0, . . . , 0),

where v = α−β−k = (k−1)(β−1)+(t−1). We can then, considering overlays of honeycombs,

treat both pairs independently. While ((γ(1), 0, . . .), (θ(1), 0, . . .)) is feasible for k ≤ m − 1,

in the second case, (γ2(2), θ

2(2)) is a convex combination of ((v + 1)#β−1), ((β − 1)#v+1) and

(v#β−1), ((β − 1)#v). Since β − 1 ≤ m − 1 and v ≤ v + 1 ≤ (m − 1)(β − 1), the proof isfinished by induction.

The following theorem has previously been conjectured by [18] and proven by [72]. Weprove it in a way which allows to identify all vertices as in Corollary 8.37.

Theorem 8.36 ([68]Kr19, [72]). Let (γ, θ) ∈ D∞≥0 × D∞≥0 and m ∈ N. If deg(γ) ≤ m and ifall Ky Fan inequalities (Corollary 8.32) as well as the trace property ‖γ‖2 = ‖θ‖2 hold, thenthe pair is feasible for m.

Proof. ([68]Kr19) Here, we denote the Ky Fan inequality (Corollary 8.32) for r with Kr, and incase of an equality we say Er holds. Due to Km, deg(γ) ≤ m and the trace property, Em anddeg(θ) ≤ m deg(γ) must be true. For fixed m, we prove by induction over deg(γ) + deg(θ).Let 0 ≤ k < m be the largest number for which Ek is fulfilled and let α = deg(θ) −mk aswell as β = deg(γ)− k. We define

(γ2 | θ2) := (γ2+ | θ2

+)− f · (mβ#k, α#β | β#mk, β#α), f > 0.

Then Ek and Kj , j < k, are true for (γ | θ) for all f > 0. Further, as long as Kk+1 holds for

(γ | θ) (which it does for any f > 0 if k = m− 1), then due to Kk−1 and Ek it follows that

γk+1 ≤ γk. Hence, f can be chosen such that Ki, i = 1, . . . ,m− 1, and (γ | θ) ∈ Dβ≥0×Dα≥0

Page 210: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

194 8.4. Cones of Squared Feasible Values

as well as either (i) Ej for at least one j, k < j < m, or (ii) γβ = 0 ∨ θα = 0. In case of (i),we can repeat the above construction for increased k until k = m−1 and hence (ii) remainsthe sole option. In that case, we are finished by induction.

Corollary 8.37 (Vertex description, [68]Kr19). A complete vertex description of Fm,(m,m2)

is given by

V = (γ, θ) ∈ Fm,(m,m2) | γ2+ = (mβ#k, α#β), θ2

+ = (β#mk, β#α),

k ∈ 0, . . . ,m− β, α, β ∈ N, β ≤ m, α ≤ βm; and k = 0 if α = βm

A short calculation shows that the number of vertices |V| is given by a polynomial withleading monomial m4/6.

Proof. ([68]Kr19) The proof of Theorem 8.36 is constructive and decomposes a squared fea-sible pair into a convex combination of squared feasible pairs in V. It hence remains toshow that the elements of V are vertices. Given any two elements v = v(k1, α1, β1),

w = w(k2, α2, β2), v2, w2 ∈ V, let y2f = v2 − f · w2, f > 0. For yf ∈ Dm≥0 × Dm

2

≥0 to betrue, we must have mk1 + α1 = mk2 + α2 as well as either (i) k1 = k2 and β1 = β2 or (ii)k2 = 0 and β1 + k1 = β2. In the second case, yf would violate Kk1 if k1 6= 0. If y2

f is again aconvex combination of elements in V, yf must be feasible. Due to the above, it then howeverfollows that v = w, y2 = (1 − f)v2. In other words, v2 cannot be a convex combination ofother elements in V.

For example, all 7 vertices v21 , . . . , v

27 of F2

2,(2,4) (m = 2) are given through

[1 10 0

00

],

[2 10 1

00

],

[1 21 0

00

],

[1 11 1

00

],

[2 11 1

10

],

[3 23 2

20

]and

[2 12 1

11

].

For m = 3, we already have 27 vertices. Although all these vertices happen to be diagonallyfeasible, this is not the case in general. For example, (5#3 | 3#5, 0#4) ∈ F2

3,(3,9) is a vertex,

but it is not diagonally feasible. Assume therefore, similarly as for Eq. (8.6), that there

are a(1)i + a

(2)i = θ2

i = 5, i = 1, . . . , 3. Due to the additional conditions regarding γ2i = 3,

i = 1, . . . , 5, we have 2 ≤ a(j)i ≤ 3 for all i = 1, 2, 3, j = 1, 2. Now since there cannot be two

values a(1)i1

+ a(2)i2

= 3, the pair cannot be diagonally feasible for m = 2.

For (γ, θ) as in Eq. (8.6), γ2+ = (7.5, 5), θ2

+ = (6, 3.5, 2, 1), we have (γ+, θ+) = 1.5v31 +

0.5v32 + 1.5v2

4 + v25 + v2

7 .

8.4.4 A Conjecture about F2m,(m2+m−2,m2+m−2)

Considering the way Lemma 8.35 is proven, there is reason for the following conjecture.

Conjecture 8.38. Let r1, r2,m ∈ N be values such that for all α ∈ 1, . . . , r1,β ∈ 1, . . . , r2, the pair (γ, θ), for γ2

+ = α#β, θ2+ = β#α, is feasible for m. Then

the complete H-description of F2m,(r1,r2) is given by the inequalities in Corollary 8.31

(including those with interchanged roles of γ and θ), as well as the trace property‖γ‖2 = ‖θ‖2.

The concept for a possible proof is similar to the one of Lemma 8.35, where the assertionsK are generalized to all inequalities in Corollary 8.31. Each time one of these holds asequality, we know that the pair can be split in two. What hence remains to be shown is thatthese parts inherit all necessary conditions, yet this part appears very technical and for now

Page 211: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 195

remains a conjecture.

For r1 = m and r2 = m2, the above conjecture holds true as it then equals Theorem 8.36,while all inequalities in Corollary 8.31 but the Ky Fan inequalities are redundant due tothe trace property. The assumption in Conjecture 8.38 is not fulfilled for all values r1, r2

and m. Using Algorithm 16, we can calculate the matrix M ∈ Nc×c for each (in practicemoderately large) c ∈ N, defined through

Mα,β = argminm∈N

(γ, θ) is feasible for m, γ2+ = α#β , θ2

+ = β#α.

As M is symmetric, we can restrict ourselves to α ≤ β. Furthermore, for every such pair(α, β) it must hold that Mα,β ≥ dβαe. So we only list the anomalous entries in which thisdoes not hold as equality.

Mr1,r2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

123 34 3 45 3 3 3 4 56 3 3 47 3 3 3 3 3 4 48 3 3 3 39 3 3 3 3 3 310 3 3 3 3 3 311 3 3 3 3 3 3 3 312 3 3 313 3 3 3 3 3 314 3 3 3 315 3 316 317 3 3181920

Table 8.1: Entries for which Mα,β > d βα e. The three shades of blue indicate d βαe ∈ 2, 3, 4, respectively.

From Table 8.1 we can observe that Mα,β ≤ 1 + dβαe, at least for α, β ≤ c = 20.Based on these numerical tests, it is further tempting to replace the assumption on r1, r2

in Conjecture 8.38 with just r1 = r2 = m2 + m − 2. We have seen that the nonredundantinequality Eq. (8.26) for r1 = 3, r2 = 6, m = 2, is not contained in Corollary 8.31, andin fact, for α = 3, β = 5, the entry Mα,β is larger than 2. So (r1, r2) = (3, 6) does notfulfill the assumption of Conjecture 8.38 for m = 2. We can also for example read of that(r1, r2) = (10, 20) requires m ≥ 5.

8.5 Practical Considerations

Several methods exist in order to prove feasibility, or at least rule it out, some of whichare constructive, and some are not. Based on the previous discussion, Algorithm 16 inSection 8.5.1 summarizes how hives can be used in order to determine feasibility. Thesubsequent Section 8.5.2 concludes with an overview over this and other methods. Matlabimplementations of algorithms mentioned in this chapter can be found under the nameTT-feasibility-toolbox or directly at

https://git.rwth-aachen.de/sebastian.kraemer1/TT-feasibility-toolbox.

Page 212: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

196 8.5. Practical Considerations

8.5.1 Linear Programming Algorithm for Feasibility Based on Hives

The description in Theorem 8.25 yields the straightforward Algorithm 16 to determine theminimal value m for which some pair (γ, θ) ∈ D∞≥0 ×D∞≥0 is feasible.

The objective function Fx, which the linear programming algorithm minimizes, is thesummed up length of all edges. However, while for exactly that reason edges and ver-tices in the underlying graph will disappear (cf. Section 8.3.1), the linear objective functiondoes not take the change of that graph into account. In many cases the algorithm nonethe-less returns a hive from which diagonal feasibility can be read off (cf. Lemma 8.8), but notnecessarily in any case in which it may be possible. Algorithm 16 always terminates for at

Algorithm 16 Linear programming check for feasibility [68]Kr19

Input: (γ, θ) ∈ Dr≥0 ×Dr≥0 with ‖γ‖2 = ‖θ‖2 for some r ∈ NOutput: minimal number m ∈ N for which (γ, θ) is feasible and a corresponding (r, 2(m−

1))-hive H with minimal total edge length

1: procedure feaslpc(γ, θ)2: for m = 2 . . . do3: as in Theorem 8.25, set L such that

edgeS(δ−1P (fP )) = x | L1x ≤ 0, L2x = 0, L3x = b

for the hive H as in Theorem 8.23

4: use a linear programming algorithm to minimize Fx subject to x ∈edgeS(δ−1

P (fP )), where F is the vector for which Fx ∈ R≥0 is the summed uplength of all (inner) edges in H

5: if no solution exists then6: continue with m+ 17: else8: return number m and hive H9: end if

10: end for11: end procedure

most m = max(deg(γ),deg(θ)) due to Lemma 8.8. In practice, a slightly different couplingof boundaries is used (cf. Fig. 8.6), since then the entire hive can then be visualized in R2.Therefor, it is required to rotate and mirror some of the honeycombs (cf. Fig. 8.7). Depend-ing on the linear programming algorithm, the input may however be too badly conditionedto allow a verification with satisfying residual.

8.5.2 Recapitulation of Feasibility Methods

In the following, we summarize which TT-related methods mentioned in Chapter 7, as wellas which main results of this chapter, allow to prove or at least rule out feasibility, and whichare (additionally) constructive. We therefore consider the feasibility of a pair (γ, θ) for anintermediate mode size m ∈ N (cf. Definition 8.7), as well as the construction of a core Nas in Corollary 8.4. Underlined checkmarks denote a favorable choice within each category.

Page 213: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

8. Honeycombs and Feasibility of Singular Values in the TT Format 197

prove feas. rule out feas. construct core

derive H-description (if manageable) [18] X X

classes of necessary ineq. (Cor. 8.31) X

linear progr. hive (if well conditioned, Sec. 8.5.1) X X if diagonal: X

Eq. (8.24) and (deg(γ) ≤ m or deg(θ) ≤ m) [72] X X

deg(γ) ≤ m and deg(θ) ≤ m (Lem. 8.8) X X

iteratively enforce sv (if convergent, Sec. 7.4.2) X X

First of all, if deg(γ) ≤ m and deg(θ) ≤ m (Lemma 8.8) holds true, then the requiredcore N can be constructed easily. If (only) one of either conditions is fulfilled, then [72] pro-vides that feasibility is equivalent to Eq. (8.24), but the proof (to the best of our knowledge)is not constructive. Further, the classes of necessary inequalities in Corollary 8.31 can beused in order to check whether feasibility can be ruled out.

If it is possible to derive the H-description of the specific cone F2m,(deg(γ),deg(θ)), then the

result can naturally be used to prove or, otherwise, rule out feasibility in an instance. Thelinear programming method utilizing hives as in Section 8.5.1 in theory gives a definite an-swer. However, the conditioning of the problem, or its size, may not allow to apply it. If,on the other hand, one is lucky, and the algorithm outputs a hive that is diagonal, then acorresponding core can be read off (cf. Section 8.3.1). In any case, the approach to somehowiteratively enforce the singular values (cf. Section 7.4.2) may yield that the pair is feasible atleast up to a certain numerical tolerance, since N has explicitly been constructed. If it doesnot converge, no judgment can be made about feasibility, other than that it might be unlikely.

Consider that certain TT-singular values σ = σTT = (σ(1), . . . , σ(d−1)) and mode sizesn = (n1, . . . , nd) ∈ Nd are given. Independent of the preferred method, each neighboringpair (γ, θ) = (σ(µ−1), σ(µ)), m = nµ for µ = 2, . . . , d− 1, may be handled in parallel in orderto construct a TT-tree SVD (cf. Eq. (3.24))

A = A(α) = G1 Σ(1) G2 . . . Σ(d−1) Gd ∈ Kn1×...×nd

whereby svTT(A) = σ (similar to Algorithm 15). While G1 and Gd can be chosen as anyunitary nodes (cf. Remark 7.12), any other Gµ, µ = 2, . . . , d− 1, can either be constructeddirectly (cf. Section 8.1) or extracted as part of a three-dimensional TT-tree SVD

Tµ = Tµ(β′µ−1, αµ, β′µ) = U1 Σ(µ−1) Gµ Σ(µ) V2 ∈ Kdeg(σ(µ−1))×nµ×deg(σ(µ))

where svTT(Tµ) = (σ(µ−1), σ(µ)) (cf. Corollary 8.4 and Lemma 7.10). Here, the nodeU1 = U1(β′µ−1, βµ−1) is βµ−1-unitary and V2 = V2(β′µ, βµ) is βµ-unitary. As we have seen,the computations can also be restricted to K = R (cf. Theorem 8.12) as the feasibilityconstraints are independent of the chosen field.

Page 214: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.
Page 215: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

Conclusions

Tree tensor networks are a key element to high-dimensional methods. In Part I, we derivedan arithmetic on tensors, arising from two elementary concepts, that naturally realizes thesimplicity behind their graphical interpretation, and that can be traced back to a com-mutative and associative product. Further, through the introduced labeling, the extensionof univariate linear functions as well as other operations to tensor product Hilbert spacesbecomes more retraceable. Not least because of the consistent and formally established vi-sualization, this allows to easily identify relations also within intricate products of tensorsbetween maps such as the partial trace and partial diagonal operations.The introduced calculus allows to reinterpret well-known statements in more general, yetcomprehensible settings. Most notable in this sense is the thereafter presented tree SVD,as it summarizes normal forms originating from different tensor formats. It has served asone of the main tools within subsequent theoretical and practical considerations. The pub-licly available tensor node toolbox realizes the introduced arithmetic as it automaticallyderives and memorizes the inherent network structures within collections of labeled ten-sors, whereby the pseudocodes of corresponding algorithms are near identical to their actualimplementations.

The tree decomposition and SVD served as theoretical foundation for the application oftensor theory to high-dimensional approximation. In Part II, we discussed ALS tensor al-gorithms, how to utilize branch-wise evaluations as well as which simplifications appear fordiscrete completion problems. Due to our elaborate theoretical framework, such considera-tions can conveniently be phrased for arbitrary (tree) tensor networks. We have shown thatthe application of a preconditioned, coarse CG method, instead of an exact solution withineach micro-step, is generally faster up to orders of magnitude. Further, the convergence rateof this method is related to the TRIP, as well as the relaxed iTRIP which is also fulfilledfor the sampling operator. On the practical side, we have observed that, even for tensorcompletion, usually only few iterations per instance are sufficient.Concerning the adaptation of the rank in iterative methods, we have demonstrated thatstability (with respect to the calibration of complexity) is an essential issue which explainsthe failure of heuristics in many, if not most situations. The subsequent analysis has shownthat, interestingly, both the improvement of convergence through a reweighting in nuclearnorm minimization, as well as an averaging process to allow for properly rank-adaptiveALS methods, lead to closely related algorithms Rwals and Salsa. Both are, in termsof our definition, stable methods and thus allow continuous transitions between manifoldsof different tensor ranks as well as the semi-implicit adaption of the model complexity. Innumerical tests, we observed that the additional scaling in our stable ALS approximation isbeneficial in the given framework. In fact, the number of required CG steps is much lowerfor Salsa than it is for Rwals, but a more sophisticated understanding remains subject tofuture work. For now, the reasoning for the scalings in Salsa from a theoretical perspectiveis based on an idealized fixed point analysis which emphasizes the difference to Rwals.

199

Page 216: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

200

We have further generalized these algorithms to continuous versions through a combinationof smoothness and low-rank constraints. This approach can be formulated in an infinite-dimensional setting, as opposed to the usual framework in which specific finite-dimensionalsubspaces are chosen. Here, the discretization is based on optimal Kolmogorov subspaces,for which we derived nested, orthonormal basis descriptions, whereas their finite dimensionscan be as large as computationally feasible. Moreover, the thin-plate regularizer can beformulated as bilinear form with ranks uniformly equal to three and can thus due to its ex-plicitly derived tree decomposition efficiently be handled. While the numerical experimentson data that stem from rolling press simulations yield promising results, open, secondaryquestions remain in particular concerning the more suitable transfer of stability related rankadaption methods to this specific setting as well as the further elaboration of connectionsbetween smoothness and low-rank constraints.

Part III introduced the tensor feasibility problem and proved its equivalence to a versionof the quantum marginal problem as well as its relation to (pure) versions that introduceadditional rank constraints. Again serving as a central tool, the tree SVD allows to de-couple conditions for feasibility in high dimensions to equivalent collections of lower-ordersubproblems that can to a large extend be handled by QMP results. Thereby, it is revealedthat squared feasible singular values associated to hierarchical families of matricizations areclosed, convex, polyhedral cones. We have further demonstrated how geometric methodssuch as cone projections can be combined with tensor network theory by providing an alter-native proof of a sufficiency result for the feasibility of largest Tucker singular values. Thisapproach moreover allows to construct a corresponding tensor near instantaneously.Last but not least, we analyzed the specific setting for the tensor train format via an equiv-alent formulation similar to Weyl’s problem. Based on a coupling of so-called honeycombs,we have discussed the (constructive) applicability of linear programming algorithms to theTT-TFP. The thereby determined hives can be interpreted and visualized in form of in-terrelated graphs, while the unions of possible configurations exhibit the H-descriptions ofcorresponding cones. We have derived the vertex-descriptions for a certain class of involvedsizes and, moreover, proven universal and simple, necessary inequalities through compara-tively elementary methods. These allow, as demonstrated, to derive further generalizationbased on other, more specific findings such as some from QMP literature. We then appliedthese more theoretical results to derive for example relations between exponential decaysof different singular values in the TT-format. As a conjecture remains the question if thesimple class of necessary inequalities that we introduced is also sufficient in the carried out,special cases. In conclusion, we discussed which approaches are best suited to confirm orrule out feasibility, as well as how to construct tensors with prescribed singular values inparallel based on the initially introduced decoupling.

Page 217: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

References

[1] M. Ali and A. Nouy, Singular value decomposition in sobolev spaces, arXiv preprintarXiv:1809.11001, (2018), https://arxiv.org/abs/1809.11001.

[2] M. Bachmayr and W. Dahmen, Adaptive near-optimal rank tensor approximation for high-dimensional operator equations, Foundations of Computational Mathematics, 15 (2015), pp. 839–898,https://doi.org/10.1007/s10208-013-9187-3.

[3] M. Bachmayr and R. Schneider, Iterative methods based on soft thresholding of hierarchicaltensors, Foundations of Computational Mathematics, (2016), pp. 1–47, https://doi.org/10.1007/

s10208-016-9314-z.

[4] M. Bachmayr, R. Schneider, and A. Uschmajew, Tensor networks and hierarchical tensors for thesolution of high-dimensional partial differential equations, Foundations of Computational Mathemat-ics, (2016), pp. 1–50, https://doi.org/10.1007/s10208-016-9317-9.

[5] J. Ballani and L. Grasedyck, A projection method to solve linear systems in tensor format, Nu-merical Linear Algebra with Applications, 20 (2013), pp. 27–43, https://doi.org/10.1002/nla.1818.

[6] J. Ballani, L. Grasedyck, and M. Kluge, Black box approximation of tensors in hierarchicaltucker format, Linear Algebra and its Applications, 438 (2013), pp. 639 – 657, https://doi.org/http://dx.doi.org/10.1016/j.laa.2011.08.010.

[7] M. M. Beylkin, G., Numerical operator calculus in higher dimensions, PNAS, 99 (2002), pp. 10246–10251, https://doi.org/10.1073/pnas.112329799.

[8] R. Bhatia, Linear algebra to quantum cohomology: The story of Alfred Horn’s inequalities, TheAmerican Mathematical Monthly, 108 (2001), pp. 289–318, https://doi.org/10.2307/2695237.

[9] J. D. Biamonte and V. Bergholm, Tensor networks in a nutshell, 2017, https://arxiv.org/abs/1708.00006.

[10] M. Bousse, N. Vervliet, I. Domanov, O. Debals, and L. De Lathauwer, Linear systems witha canonical polyadic decomposition constrained solution: Algorithms and applications, Numer LinearAlgebra Appl, 25 (2018), p. e2190, https://doi.org/10.1002/nla.2190.

[11] E. J. Candes and B. Recht, Exact matrix completion via convex optimization, Foundations ofComputational Mathematics, 9 (2009), p. 717, https://doi.org/10.1007/s10208-009-9045-5.

[12] E. J. Candes, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and inaccuratemeasurements, Communications on Pure and Applied Mathematics: A Journal Issued by the CourantInstitute of Mathematical Sciences, 59 (2006), pp. 1207–1223, https://doi.org/10.1002/cpa.20124.

[13] E. J. Candes and T. Tao, Decoding by linear programming, IEEE Transactions on InformationTheory, 51 (2005), pp. 4203–4215, https://doi.org/10.1109/TIT.2005.858979.

[14] E. J. Candes and T. Tao, The power of convex relaxation: Near-optimal matrix completion, IEEETrans. Inf. Theor., 56 (2010), pp. 2053–2080, https://doi.org/10.1109/TIT.2010.2044061.

[15] E. J. Candes and M. B. Wakin, An introduction to compressive sampling, IEEE Signal ProcessingMagazine, 25 (2008), pp. 21–30, https://doi.org/10.1109/MSP.2007.914731.

[16] E. J. Candes, M. B. Wakin, and S. P. Boyd, Enhancing sparsity by reweighted 1 minimization,Journal of Fourier Analysis and Applications, 14 (2008), pp. 877–905, https://doi.org/10.1007/

s00041-008-9045-x.

[17] A. Cichocki, Era of big data processing: A new approach via tensor networks and tensor decompo-sitions, arXiv preprint arXiv:1403.2048, (2014), https://arxiv.org/abs/1403.2048.

[18] S. Daftuar and P. Hayden, Quantum state transformations and the Schubert calculus, Annals ofPhysics, 315 (2005), pp. 80 – 122, https://doi.org/10.1016/j.aop.2004.09.012. Special Issue.

[19] I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk, Iteratively re-weighted least squaresminimization: Proof of faster than linear rate for sparse recovery, (2008), pp. 26–29, https://doi.org/10.1109/CISS.2008.4558489.

201

Page 218: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

202 REFERENCES

[20] L. De Lathauwer, A survey of tensor methods, in 2009 IEEE International Symposium on Circuitsand Systems (ISCAS), May 2009, pp. 2773–2776, https://doi.org/10.1109/ISCAS.2009.5118377.

[21] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singular value decomposition,SIAM Journal on Matrix Analysis and Applications, 21 (2000), pp. 1253–1278, https://doi.org/10.1137/S0895479896305696.

[22] S. V. Dolgov and D. V. Savostyanov, Alternating minimal energy methods for linear systemsin higher dimensions, SIAM Journal on Scientific Computing, 36 (2014), pp. A2248–A2271, https://doi.org/10.1137/140953289.

[23] I. Domanov, A. Stegeman, and L. De Lathauwer, On the largest multilinear singular values ofhigher-order tensors, SIAM Journal on Matrix Analysis and Applications, 38 (2017), pp. 1434–1453,https://doi.org/10.1137/16M110770X.

[24] F. M. Dopico, A note on sin θ theorems for singular subspace variations, BIT Numerical Mathematics,40 (2000), pp. 395–403, https://doi.org/10.1023/A:1022303426500.

[25] J. Duchon, Splines minimizing rotation-invariant semi-norms in sobolev spaces, in ConstructiveTheory of Functions of Several Variables, W. Schempp and K. Zeller, eds., Berlin, Heidelberg, 1977,Springer Berlin Heidelberg, pp. 85–100, https://doi.org/10.1007/BFb0086566.

[26] D. Eberly, Thin-plate splines, 2015, https://www.geometrictools.com/Documentation/

ThinPlateSplines.pdf.

[27] J. Eisert, Entanglement and tensor network states, arXiv:1308.3318, (2013), https://arxiv.org/

abs/1308.3318.

[28] M. Espig, W. Hackbusch, and A. Khachatryan, On the convergence of alternating least squaresoptimisation in tensor format representations, arXiv:1506.00062, (2015), https://arxiv.org/abs/

1503.05431.

[29] M. Espig and A. Khachatryan, Convergence of alternating least squares optimisation for rank-oneapproximation to high order tensors, arXiv:1503.05431, (2015), https://arxiv.org/abs/1503.05431.

[30] S. Etter, Parallel als algorithm for solving linear systems in the hierarchical tucker representation,SIAM Journal on Scientific Computing, 38 (2016), pp. A2585–A2609, https://doi.org/10.1137/

15M1038852.

[31] K. Fan, On a theorem of Weyl concerning eigenvalues of linear transformations. i, Proceedings ofthe National Academy of Sciences of the United States of America, 35 (1949), pp. 652–655, https://doi.org/10.1073/pnas.35.11.652.

[32] M. Fornasier, H. Rauhut, and R. Ward, Low-rank matrix recovery via iteratively reweighted leastsquares minimization, SIAM Journal on Optimization, 21 (2011), pp. 1614–1640, https://doi.org/10.1137/100811404.

[33] M. Franz, Moment polytopes of projective Gvarieties and tensor products of symmetric group rep-resentations, J. Lie Theory, 12 (2002), pp. 539–549, http://www.heldermann.de/JLT/JLT12/JLT122/jlt12033.htm.

[34] S. Friedland, Finite and infinite dimensional generalizations of Klyachko’s theorem, Linear Algebraand its Applications, 319 (2000), pp. 3 – 22, https://doi.org/10.1016/S0024-3795(00)00217-2.

[35] S. Friedland and L.-H. Lim, Computational complexity of tensor nuclear norm, arXiv preprintarXiv:1410.6072, (2014), https://arxiv.org/abs/1410.6072.

[36] W. Fulton, Eigenvalues, invariant factors, highest weights, and Schubert calculus, Bull. Amer. Math.Soc. (N.S.), 37 (2000), pp. 209–249, https://doi.org/10.1090/S0273-0979-00-00865-X.

[37] W. Fulton, Eigenvalues of majorized Hermitian matrices and Littlewood-Richardson coefficients, Lin-ear Algebra and its Applications, 319 (2000), pp. 23 – 36, https://doi.org/10.1016/S0024-3795(00)00218-4.

[38] S. Gandy, B. Recht, and I. Yamada, Tensor completion and low-n-rank tensor recovery via convexoptimization, Inverse Problems, 27 (2011), p. 025010, https://doi.org/10.1088/0266-5611/27/2/

025010.

[39] K. Glau, D. Kressner, and F. Statti, Low-rank tensor approximation for chebyshev interpolationin parametric option pricing, arXiv preprint arXiv:1902.04367, (2019), https://arxiv.org/abs/1902.04367.

[40] L. Grasedyck, Hierarchical singular value decomposition of tensors, SIAM Journal on Matrix Anal-ysis and Applications, 31 (2010), pp. 2029–2054, https://doi.org/10.1137/090764189.

[41] L. Grasedyck, M. Kluge, and S. Kramer, Variants of alternating least squares tensor completionin the tensor train format, SIAM Journal on Scientific Computing, 37 (2015), pp. A2424–A2450,https://doi.org/10.1137/130942401.

Page 219: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

REFERENCES 203

[42] L. Grasedyck and S. Kramer, Stable als approximation in the tt-format for rank-adaptive tensorcompletion, Numerische Mathematik, (2019), https://doi.org/10.1007/s00211-019-01072-4.

[43] L. Grasedyck, D. Kressner, and C. Tobler, A literature survey of low-rank tensor approximationtechniques, GAMM-Mitteilungen, 36 (2013), pp. 53–78, https://doi.org/10.1002/gamm.201310004.

[44] E. Grelier, A. Nouy, and M. Chevreuil, Learning with tree-based tensor formats, arXiv preprintarXiv:1811.04455, (2018), https://arxiv.org/abs/1811.04455.

[45] D. Gross, Recovering low-rank matrices from few coefficients in any basis, IEEE Transactions onInformation Theory, 57 (2011), pp. 1548–1566, https://doi.org/10.1109/TIT.2011.2104999.

[46] W. Hackbusch, Tensor Spaces and Numerical Tensor Calculus, Springer Berlin Heidelberg, 2012,https://doi.org/10.1007/978-3-642-28027-6.

[47] W. Hackbusch, D. Kressner, and A. Uschmajew, Perturbation of higher-order singular values,SIAM Journal on Applied Algebra and Geometry, 1 (2017), pp. 374–387, https://doi.org/10.1137/16M1089873.

[48] W. Hackbusch and S. Kuhn, A new scheme for the tensor representation, Journal of Fourier Analysisand Applications, 15 (2009), pp. 706–722, https://doi.org/10.1007/s00041-009-9094-9.

[49] W. Hackbusch and A. Uschmajew, On the interconnection between the higher-order singular val-ues of real tensors, Numerische Mathematik, 135 (2017), pp. 875–894, https://doi.org/10.1007/

s00211-016-0819-9.

[50] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh, Matrix completion and low-rank svd viafast alternating least squares, J. Mach. Learn. Res., 16 (2015), pp. 3367–3402, http://dl.acm.org/citation.cfm?id=2789272.2912106.

[51] T. L. Hayden, The extension of bilinear functionals., Pacific J. Math., 22 (1967), pp. 99–108, https://projecteuclid.org:443/euclid.pjm/1102992296.

[52] U. Helmke and J. Rosenthal, Eigenvalue inequalities and Schubert calculus, MathematischeNachrichten, 171 (1995), pp. 207–225, https://doi.org/10.1002/mana.19951710113.

[53] A. Higuchi, On the one-particle reduced density matrices of a pure three-qutrit quantum state, (2003),https://arxiv.org/abs/quant-ph/0309186.

[54] A. Higuchi, A. Sudbery, and J. Szulc, One-qubit reduced states of a pure many-qubit state: Poly-gon inequalities, Phys. Rev. Lett., 90 (2003), p. 107902, https://doi.org/10.1103/PhysRevLett.90.107902.

[55] F. L. Hitchcock, The expression of a tensor or a polyadic as a sum of products, Journal of Mathe-matics and Physics, 6 (1927), pp. 164–189, https://doi.org/10.1002/sapm192761164.

[56] S. Holtz, T. Rohwedder, and R. Schneider, The alternating linear scheme for tensor optimizationin the tensor train format, SIAM Journal on Scientific Computing, 34 (2012), pp. A683–A713, https://doi.org/10.1137/100818893.

[57] S. Holtz, T. Rohwedder, and R. Schneider, On manifolds of tensors of fixed tt-rank, NumerischeMathematik, 120 (2012), pp. 701–731, https://doi.org/10.1007/s00211-011-0419-7.

[58] A. Horn, Eigenvalues of sums of Hermitian matrices., Pacific J. Math., 12 (1962), pp. 225–241,http://projecteuclid.org/euclid.pjm/1103036720.

[59] D. (https://mathoverflow.net/users/9652/dirk), Orthogonal system of functions ordered bynorm of second derivative. MathOverflow, https://mathoverflow.net/q/289921. (version: 2018-01-05).

[60] P. Jain, P. Netrapalli, and S. Sanghavi, Low-rank matrix completion using alternating minimiza-tion, in Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing, STOC ’13,New York, NY, USA, 2013, ACM, pp. 665–674, https://doi.org/10.1145/2488608.2488693.

[61] E. Jeckelmann, Dynamical density-matrix renormalization-group method, Phys. Rev. B, 66 (2002),p. 045114, https://doi.org/10.1103/PhysRevB.66.045114.

[62] A. A. Klyachko, Stable bundles, representation theory and Hermitian operators, Selecta Math. (N.S.),4 (1998), pp. 419–445, https://doi.org/10.1007/s000290050037.

[63] A. A. Klyachko, Quantum marginal problem and N-representability, Journal of Physics: ConferenceSeries, 36 (2006), pp. 72–86, https://doi.org/10.1088/1742-6596/36/1/014.

[64] A. Knutson and T. Tao, The honeycomb model of GLn(C) tensor products. I. Proof of the sat-uration conjecture, J. Amer. Math. Soc., 12 (1999), pp. 1055–1090, https://doi.org/10.1090/

S0894-0347-99-00299-4.

[65] A. Knutson and T. Tao, Honeycombs and sums of Hermitian matrices, Notices Amer. Math. Soc.,48 (2001), pp. 175–186.

Page 220: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

204 REFERENCES

[66] A. Knutson, T. Tao, and C. Woodward, The honeycomb model of GLn(C) tensor products. II.Puzzles determine facets of the Littlewood-Richardson cone, J. Amer. Math. Soc., 17 (2004), pp. 19–48, https://doi.org/10.1090/S0894-0347-03-00441-7.

[67] A. Kolmogoroff, Uber die beste annaherung von funktionen einer gegebenen funktionenklasse, An-nals of Mathematics, 37 (1936), pp. 107–110, http://www.jstor.org/stable/1968691.

[68] S. Kramer, A geometric description of feasible singular values in the tensor train format, SIAMJournal on Matrix Analysis and Applications, 40 (2019), pp. 1153–1178, https://doi.org/10.1137/18M1192408.

[69] D. Kressner, M. Steinlechner, and B. Vandereycken, Low-rank tensor completion by rieman-nian optimization, BIT Numerical Mathematics, 54 (2014), pp. 447–468, https://doi.org/10.1007/s10543-013-0455-z.

[70] D. Kressner, M. Steinlechner, and B. Vandereycken, Preconditioned low-rank riemannian opti-mization for linear systems with tensor product structure, SIAM Journal on Scientific Computing, 38(2016), pp. A2018–A2044, https://doi.org/10.1137/15M1032909.

[71] J. M. Landsburg, Y. Qi, and K. Ye, On the geometry of tensor network states, Quantum Info.Comput., 12 (2012), pp. 346–354, http://dl.acm.org/citation.cfm?id=2230976.2230988.

[72] C.-K. Li, Y.-T. Poon, and X. Wang, Ranks and eigenvalues of states with prescribed reduced states,Electronic Journal of Linear Algebra, 27 (2014), https://doi.org/10.13001/1081-3810.2882.

[73] Y. Liu and F. Shang, An efficient matrix factorization method for tensor completion, IEEE SignalProcessing Letters, 20 (2013), pp. 307–310, https://doi.org/10.1109/LSP.2013.2245416.

[74] J. Lohmar, S. Seuren, M. Bambach, and G. Hirt, Design and application of an advanced fast rollingmodel with through thickness resolution for heavy plate rolling, in Ingot casting, rolling & forging :ICRF; 2nd International Conference; Milan, Italy, AIM, Associazione Italiana di Metallurgia, 2014,http://publications.rwth-aachen.de/record/225848.

[75] H. G. Matthies and E. Zander, Solving stochastic systems with low-rank tensor compression, LinearAlgebra and its Applications, 436 (2012), pp. 3819 – 3838, https://doi.org/http://dx.doi.org/10.1016/j.laa.2011.04.017.

[76] K. Mohan and M. Fazel, Iterative reweighted algorithms for matrix rank minimization, Journal ofMachine Learning Research, 13 (2012), pp. 3441–3473, http://www.jmlr.org/papers/v13/mohan12a.html.

[77] C. Mu, B. Huang, J. Wright, and D. Goldfarb, Square deal: Lower bounds and improved relax-ations for tensor recovery, in Proceedings of the 31st International Conference on Machine Learning(ICML-14), T. Jebara and E. P. Xing, eds., JMLR Workshop and Conference Proceedings, 2014,pp. 73–81, http://jmlr.org/proceedings/papers/v32/mu14.pdf.

[78] B. Natarajan, Sparse approximate solutions to linear systems, SIAM Journal on Computing, 24(1995), pp. 227–234, https://doi.org/10.1137/S0097539792240406.

[79] A. Nouy, Low-Rank Tensor Methods for Model Order Reduction, Springer International Publishing,Cham, 2017, pp. 857–882, https://doi.org/10.1007/978-3-319-12385-1_21.

[80] R. Ors, A practical introduction to tensor networks: Matrix product states and projected entangledpair states, Annals of Physics, 349 (2014), pp. 117 – 158, https://doi.org/https://doi.org/10.

1016/j.aop.2014.06.013.

[81] I. Oseledets and S. Dolgov, Solution of linear systems and matrix inversion in the tt-format, SIAMJournal on Scientific Computing, 34 (2012), pp. A2718–A2739, https://doi.org/10.1137/110833142.

[82] I. Oseledets, M. Rakhuba, and A. Uschmajew, Alternating least squares as moving subspace cor-rection, SIAM Journal on Numerical Analysis, 56 (2018), pp. 3459–3479, https://doi.org/10.1137/17M1148712.

[83] I. Oseledets and E. Tyrtyshnikov, Tt-cross approximation for multidimensional arrays, LinearAlgebra and its Applications, 432 (2010), pp. 70 – 88, https://doi.org/10.1016/j.laa.2009.07.024.

[84] I. V. Oseledets, Tensor-train decomposition, SIAM Journal on Scientific Computing, 33 (2011),pp. 2295–2317, https://doi.org/10.1137/090752286.

[85] H. Rauhut, R. Schneider, and Z. Stojanac, Tensor Completion in Hierarchical Tensor Repre-sentations, Springer International Publishing, Cham, 2015, pp. 419–450, https://doi.org/10.1007/978-3-319-16042-9_14.

[86] B. Recht, A simpler approach to matrix completion, Journal of Machine Learning Research, 12 (2011),pp. 3413–3430, http://www.jmlr.org/papers/v12/recht11a.html.

[87] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions of linear matrix equationsvia nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501, https://doi.org/10.1137/070697835.

Page 221: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.

REFERENCES 205

[88] T. Rohwedder and A. Uschmajew, On local convergence of alternating schemes for optimizationof convex problems in the tensor train format, SIAM Journal on Numerical Analysis, 51 (2013),pp. 1134–1162, https://doi.org/10.1137/110857520.

[89] C. Schilling, Quantum marginal problem and its physical relevance, PhD thesis, ETH Zurich,2014, https://doi.org/10.3929/ethz-a-010139282. Diss., Eidgenssische Technische Hochschule ETHZrich, Nr. 21748, 2014.

[90] A. Seigal, Gram determinants of real binary tensors, Linear Algebra and its Applications, 544 (2018),pp. 350–369, https://doi.org/10.1016/j.laa.2018.01.019.

[91] A. L. Seigal, Structured Tensors and the Geometry of Data, PhD thesis, UC Berkeley, 2019, https://escholarship.org/uc/item/9jv5j0f4.

[92] M. Signoretto, Q. TranDinh, L. De Lathauwer, and J. A. K. Suykens, Learning with tensors:a framework based on convex optimization and spectral regularization, Machine Learning, 94 (2014),pp. 303–351, https://doi.org/10.1007/s10994-013-5366-3.

[93] C. D. Silva and F. J. Herrmann, Optimization on the hierarchical tucker manifold applications totensor completion, Linear Algebra and its Applications, 481 (2015), pp. 131 – 173, https://doi.org/10.1016/j.laa.2015.04.015.

[94] M. Steinlechner, Riemannian optimization for high-dimensional tensor completion, SIAM Journalon Scientific Computing, 38 (2016), pp. S461–S484, https://doi.org/10.1137/15M1010506.

[95] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika, 31 (1966),pp. 279–311, https://doi.org/10.1007/BF02289464.

[96] A. Uschmajew, Regularity of tensor product approximations tosquare integrable functions, Construc-tive Approximation, 34 (2011), pp. 371–391, https://doi.org/10.1007/s00365-010-9125-4.

[97] A. Uschmajew, Local convergence of the alternating least squares algorithm for canonical tensorapproximation, SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 639–652, https://doi.org/10.1137/110843587.

[98] A. Uschmajew and B. Vandereycken, The geometry of algorithms using hierarchical tensors, LinearAlgebra and its Applications, 439 (2013), pp. 133 – 166, https://doi.org/10.1016/j.laa.2013.03.016.

[99] A. Uschmajew and B. Vandereycken, On critical points of quadratic low-rank matrix optimizationproblems. Tech. report (submitted), July 2018.

[100] G. Vidal, Efficient classical simulation of slightly entangled quantum computations, Phys. Rev. Lett.,91 (2003), p. 147902, https://doi.org/10.1103/PhysRevLett.91.147902.

[101] P.-A. Wedin, Perturbation bounds in connection with singular value decomposition, BIT NumericalMathematics, 12 (1972), pp. 99–111, https://doi.org/10.1007/BF01932678.

[102] Z. Wen, W. Yin, and Y. Zhang, Solving a low-rank factorization model for matrix completion by anonlinear successive over-relaxation algorithm, Mathematical Programming Computation, 4 (2012),pp. 333–361, https://doi.org/10.1007/s12532-012-0044-1.

[103] H. Weyl, Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichun-gen (mit einer anwendung auf die theorie der hohlraumstrahlung), Mathematische Annalen, 71 (1912),pp. 441–479, https://doi.org/10.1007/BF01456804.

Page 222: Tree Tensor Networks, Associated Singular Values and High … · 2020. 5. 21. · their importance for high-dimensional approximation, in particular for model complexity adaption.