High Performance Computing in Deterministic Global ...

147
High Performance Computing in Deterministic Global Optimization (Computación de Altas Prestaciones en Optimización Global Determinista)

Transcript of High Performance Computing in Deterministic Global ...

Page 1: High Performance Computing in Deterministic Global ...

High Performance Computing inDeterministic Global Optimization(Computación de Altas Prestaciones en

Optimización Global Determinista)

Page 2: High Performance Computing in Deterministic Global ...
Page 3: High Performance Computing in Deterministic Global ...

High Performance Computing inDeterministic Global Optimization(Computación de Altas Prestaciones en

Optimización Global Determinista)

Juan F. R. Herrera

Supervised by

Dr. Leocadio G. Casado and Dr. Eligius M. T. Hendrix

Page 4: High Performance Computing in Deterministic Global ...

Juan F. R. Herrera

High Performance Computing in Deterministic Global Optimization(Computación de Altas Prestaciones en Optimización Global Determinista)133 pages

Thesis, University of Almeria, Almeria, ES (2015)With summaries in English and Spanish

Page 5: High Performance Computing in Deterministic Global ...

Acknowledgement

The elaboration of this Ph.D. thesis has been possible thanks to the collaboration andassistance of the people and the institutions that I will mention as follows.

In the first place, I would like to give my sincere gratitude to my thesis supervisors: Leoand Eligius. Without their trust and good guidance since the beginning, this work wouldnot have been possible. Another person who also advised me during this period was Inma,a person to whom I am also grateful.

I would like to express my gratitude to everyone of the members of the research groupI belong to, whose name is “Supercomputación: Algoritmos”, for all the moments sharedwithin and outside the University of Almeria during the last five years. Several members areworking in Europe (Vienna, and Paris, to name a few). New members came to the groupto contribute to the success of the research team. I have spent with all of them breakfasts,conferences. . . not to mention lots of hours of work before deadlines.

During these years, I have had the opportunity to visit several research centres aroundEurope. I have good memories of my first research stay at the EPCC three years ago,granted by the HPC-Europa2 programme. During this two-month stay, I could be a userof HECToR, the UK’s high-end computing resource at that time. I also met people fromTenerife and other parts of Europe like Austria, Germany, and Italy.

My second stay, the longest one, was at Wageningen University, the best university ofthe Netherlands. The Operations Research and Logistics team welcomed me with openarms. I learnt many positive things during my three-month stay. I thank Ron, and Maaike,for taking care of me during the whole stay. Dank U wel.

My last stay, the shortest one, took place at the PRiSM laboratoire. I thank Bertrand,Tarek, and the rest of the team for their kindness during my one-month stay in Versailles.I thank Juan Álvaro for being my host during the first days. He suggested me the Colleged’Espagne, a place where I had the opportunity to meet great people. Merci beaucoup.

I also express my gratitude to my family and friends who, directly or indirectly, havelent their hand in this project.

And, last but not least, I wish to thank and dedicate this book to the people who supportme everyday: my parents and my sister. Muchísimas gracias. Os quiero.

Thanks everyone.

Juan F. R. HerreraOctober 4, 2015

i

Page 6: High Performance Computing in Deterministic Global ...
Page 7: High Performance Computing in Deterministic Global ...

Preface

Every day, one faces decisions to be made within a framework of possibilities thateven have a dynamic character. Today’s decisions influence the possibilities of tomorrow.Models of decision making optimize an objective function within requirements expressedas constraints. In contrast to probabilistic methods, the use of exhaustive deterministicsearch methods guarantees that the solutions are the best ones for the requested accuracy.The main drawback of deterministic methods is that the required computational burdenis high, causing long execution times. Here, High Performance Computing (HPC) playsan important role to make the solution of problems tractable. HPC is necessary not onlyto speed up the response time of algorithms, but it is also essential to deal with problemswhose computational requirements exceed the resources offered by commodity computers.

This Ph.D. thesis pretends to study and analyse mathematical and computational as-pects related to deterministic methods to solve Global Optimization (GO) problems inparallel. In particular, problems having a direct application to industry and society areaddressed. The main aim is to exploit the whole computational performance that su-percomputers offer. To reach this aim, this work studies the improvement and subsequentparallelization of two deterministic algorithms that solve GO problems: Branch and Bound,and stochastic Dynamic Programming. Algorithms are studied from a computational pointof view. The main goal of this study is to show how this kind of problems can be solvedin a reasonable elapsed time through HPC techniques. Chapter 1 introduces the conceptof GO as well as the above-mentioned methods. In addition, a brief review of the currentstate of parallel computing is provided.

In Part I, a Branch and Bound scheme to solve two GO problems is developed. The firstproblem is related to the mixture design of two products that share scarce raw material. Thesequential Branch and Bound algorithm is addressed in Chapter 2. Different approaches toparallelize the algorithms presented in Chapter 2 are described in Chapter 3. The secondproblem relates to the multidimensional GO problem using constants with respect to globalinformation about the structure of the instances. Aspects like division of the search spaceand the search strategy are addressed in Chapter 4. The evaluation of the search space canbe performed in parallel. Chapter 5 discusses a hybrid approach, combination of MPI andPthreads, which is designed for a cluster of multi-core processors.

In Part II, the dynamic control of traffic lights is investigated, such that the averagewaiting time for vehicles is minimal. In Chapter 6, a Markov Decision Process is imple-mented in C, generating Traffic Control Tables for signalized intersections in isolation aswell as for networks of intersections. Chapter 7 analyses the state space and the possiblepartitioning of it in order to solve instances where the required memory is greater than theavailable memory of a commodity computer.

Chapter 8 summarizes the findings obtained during the elaboration of this Ph.D. thesis.

iii

Page 8: High Performance Computing in Deterministic Global ...
Page 9: High Performance Computing in Deterministic Global ...

Prefacio

A diario, uno debe tomar una decisión entre un abanico de posibilidades que tienen uncarácter dinámico en ciertas ocasiones. Las decisiones que se toman hoy influencian las po-sibilidades del mañana. A través de modelos de toma de decisiones, se optimiza una funciónobjetivo dentro de unos requisitos expresados en forma de restricciones. Al contrario quelos métodos probabilistas, el uso de métodos deterministas de búsqueda exhaustiva permiteresolver tales modelos con una precisión garantizada. El principal inconveniente de los mé-todos deterministas es que requieren un esfuerzo computacional alto que conlleva tiemposde ejecución largos. En este caso, la computación de altas prestaciones desempeña un papelmuy importante para hacer abordable la solución de problemas de forma determinista. Elparalelismo es necesario, no sólo para acelerar el tiempo de respuesta de un algoritmo, sinopara resolver problemas cuyos requisitos computacionales sobrepasan los recursos ofrecidospor los ordenadores de sobremesa.

Esta tesis doctoral tiene como objetivo el estudio y el análisis de los aspectos tanto ma-temáticos como computacionales relacionados con los métodos deterministas para resolverproblemas de optimización global en paralelo. En concreto, se abordarán problemas quetengan una relación directa con la investigación operativa. El principal objetivo es explotartodo el potencial que un supercomputador nos ofrece. Para alcanzar esta meta, en estetrabajo se estudia la mejora y posterior paralelización de dos métodos deterministas queresuelven problemas de optimización global: ramificación y acotación, y programación diná-mica estocástica. Los algoritmos serán estudiados desde un punto de vista computacional.El principal objetivo de este estudio es mostrar cómo este tipo de problemas puede serresuelto en un tiempo razonable usando técnicas de computación de altas prestaciones. ElCapítulo 1 introduce el concepto de optimización global, así como los métodos anterior-mente mencionados. Además, proporciona una breve revisión acerca del estado actual de lacomputación en paralelo.

En la Parte I, se tratarán los algoritmos de ramificación y acotación para la resoluciónde dos problemas de distinta índole. El primer problema está relacionado con el diseño demezclas para dos productos que comparten materias primas escasas. El algoritmo secuencialse explica en el Capítulo 2. En el Capítulo 3, se muestran varias versiones paralelas paraacelerar los algoritmos presentados en el Capítulo 2. El segundo problema está relacionadocon la optimización global multidimensional usando constantes con respecto a la informaciónglobal sobre la estructura de los casos a resolver. Aspectos como la división del espaciode búsqueda y la estrategia de búsqueda se muestran en el Capítulo 4. La evaluación delespacio de búsqueda puede ser llevada a cabo en paralelo. El Capítulo 5 expone una soluciónhíbrida, combinación de MPI y Pthreads, que es diseñada para un clúster de procesadoresmulti-núcleo.

En la Parte II, se investiga el control dinámico de los semáforos, de tal forma que eltiempo de espera de los vehículos en un cruce dado sea el mínimo posible. En el Capítulo 6,

v

Page 10: High Performance Computing in Deterministic Global ...

vi Prefacio

se implementa en C un proceso de decisiones de Markov que genera tablas de control detráfico para cruces regulados por semáforo y para dos cruces conectados entre sí. El Capítulo7 analiza el espacio de estados y la posible agrupación de estos para resolver casos dondela memoria RAM requerida es mayor que la disponible en un ordenador de sobremesa.

El Capítulo 8 reúne las conclusiones obtenidas durante la elaboración de esta tesisdoctoral.

Page 11: High Performance Computing in Deterministic Global ...

Contents

Acknowledgement i

Preface iii

Prefacio v

Contents vii

1 Introduction 11.1 Main concepts of Global Optimization . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Pareto optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Solution approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Description of Branch and Bound . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Branching rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Bounding rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.3 Selection rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.4 General B&B method . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.5 Search anomalies in parallel B&B . . . . . . . . . . . . . . . . . . . . 10

1.3 Description of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 101.4 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.1 Shared-memory model . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.2 Distributed-memory model . . . . . . . . . . . . . . . . . . . . . . . 161.4.3 Heterogeneous model . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.4 Hybrid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.5 Computing infrastructure used in this thesis . . . . . . . . . . . . . . 181.4.6 Parallel performance measurement . . . . . . . . . . . . . . . . . . . 18

1.5 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

I Branch and Bound 25

2 Branch and Bound applied to the bi-blending problem 272.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.1 Blending problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1.2 Bi-blending problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Algorithm for finding a solution . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.1 Branching rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vii

Page 12: High Performance Computing in Deterministic Global ...

viii Contents

2.2.2 Bounding rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2.3 Termination rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.4 Selection rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.5 Rejection rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3 New bi-blending rejection rules . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.1 Capacity test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.2 Pareto test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Final testing after the completion of the algorithm . . . . . . . . . . . . . . 362.5 Iterative-descending B&B strategy . . . . . . . . . . . . . . . . . . . . . . . 372.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.1 Three-dimensional cases . . . . . . . . . . . . . . . . . . . . . . . . . 392.6.2 Five-dimensional cases . . . . . . . . . . . . . . . . . . . . . . . . . . 432.6.3 Seven-dimensional cases . . . . . . . . . . . . . . . . . . . . . . . . . 442.6.4 Experimental results for the iterative-descending algorithm . . . . . 44

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Parallelization of the bi-blending algorithm 47

3.1 Parallel strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1 B&B phase in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.2 Combination phase in parallel . . . . . . . . . . . . . . . . . . . . . . 49

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Simplicial Branch and Bound applied to Global Optimization 51

4.1 Simplicial B&B method for multidimensional GO . . . . . . . . . . . . . . . 524.1.1 Initial space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.2 Branching rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Bounding rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.1.4 Selection rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.5 Rejection rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.1 Comparison of selection strategies . . . . . . . . . . . . . . . . . . . 584.2.2 Comparison of the LEB strategies . . . . . . . . . . . . . . . . . . . 59

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Parallelization of simplicial Branch and Bound 65

5.1 Branch and Bound in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 Shared-memory models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.1 Bobpp framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.2 TBB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.3 Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Message passing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3.1 Hybrid MPI-Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3.2 Inter-node dynamic load balancing . . . . . . . . . . . . . . . . . . . 69

5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.4.1 Shared-memory approach . . . . . . . . . . . . . . . . . . . . . . . . 705.4.2 Distributed-memory approach . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Page 13: High Performance Computing in Deterministic Global ...

Contents ix

II Dynamic Programming 77

6 Dynamic Programming applied to traffic control 796.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Model description for a single intersection . . . . . . . . . . . . . . . . . . . 80

6.2.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2.2 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2.3 Formulation as a Markov Decision Process . . . . . . . . . . . . . . . 816.2.4 Bellman’s principle of optimality . . . . . . . . . . . . . . . . . . . . 82

6.3 Studied cases of the TCT model . . . . . . . . . . . . . . . . . . . . . . . . 826.3.1 State s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.3.2 Control action x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3.3 State transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3.4 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.4 Value Iteration through backward induction . . . . . . . . . . . . . . . . . . 856.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Determination of Traffic Control Tables in parallel 917.1 Parallel models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.1.1 Shared-memory approach . . . . . . . . . . . . . . . . . . . . . . . . 937.1.2 Distributed-memory model . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8 Conclusion 978.1 Discussion of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . 978.2 Future lines of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Appendices 101A Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103B Function definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109C Publications arisen from this thesis . . . . . . . . . . . . . . . . . . . . . . . 115D Other publications produced during the elaboration of this thesis . . . . . . 119

Bibliography 121

List of Figures 129

List of Tables 131

List of Algorithms 133

Page 14: High Performance Computing in Deterministic Global ...
Page 15: High Performance Computing in Deterministic Global ...

CHAPTER

1Introduction

This chapter introduces the concept of Global Optimization. This concept will be re-ferred to in forthcoming chapters. Methods like Branch and Bound, and Dynamic Program-ming are described. Moreover, an overview about the current state of parallel computingis presented.

1.1 Main concepts of Global Optimization

One of the most fundamental principles in our world is finding an optimal decision. Manyrecent advances in fields such as science, economics or engineering rely on numerical tech-niques to calculate the best global solutions in optimization problems. The aim of GlobalOptimization is to find the best global solution of a (possibly non-linear) model in the pres-ence (or not) of multiple local optima, although such a solution may not exist. Non-linearmodels are present in many applications, such as advanced engineering design, biotech-nology, data analysis, environmental management, financial planning, process control, riskmanagement, scientific modelling, etc.

The formulation of the problem of Global Optimization has the general form

minimize or maximize f(x)subject to x ∈ D

where D ⊂ Rn is the feasible domain and f : A → R is the objective function, whereA ⊂ Rn is a set which includes D.

In optimization problems, it is said that a point x∗ ∈ D is a local minimum if f(x∗) ≤f(x) for all x ∈ D which satisfies that ‖x− x∗‖ ≤ ε, where ε > 0 and ‖ · ‖ a distance norm.In the same way, a point x∗ ∈ D is a local maximum if f(x∗) ≥ f(x). We said that x∗ isa global minimum if f(x∗) ≤ f(x), ∀x ∈ D. The global optimum value of f is denoted byf(x∗) or f∗.

Figure 1.1 illustrates a function f defined in two-dimensional space D. As shown in thefigure, one can distinguish the local and global optima. A global optimum is optimal in theentire search space D, while a local optimum is optimal for only a subset of D.

1

Page 16: High Performance Computing in Deterministic Global ...

2 Introduction

1

2

local maximum

local maximum

local minimum

global maximum

global minimum

f(x)

D

x

x

Figure 1.1: Global and local optima in a two-dimensional function

One example is the search of the highest peak above sea level on Earth. A local optimumin Europe will be the Mount Elbrus, Kilimanjaro in the African continent, and MountEverest in Asia. However, the global optimum is Mount Everest, because it is the highestpeak of Earth’s surface. One can see that the local optimum for Asia is also the globaloptimum for the Earth. This illustrates that a global optimum belongs to the set of localoptima.

From now on, the problem of Global Optimization will be considered in the form

minx∈D

f(x), (1.1)

where D = x ∈ Rn : gi(x) ≤ 0, i = 1, . . . , p with constraints gi : A → R, whereset A is often A = Rn. Maximization problems are also considered in (1.1), becausemaxf(x) : x ∈ D = −minf(x) : x ∈ D. Furthermore, since gi(x) ≥ 0 is equivalentto −gi(x) ≤ 0, and gi(x) = 0 is equivalent to gi(x) ≤ 0 and −gi(x) ≤ 0, definition (1.1)considers many other types of constraints.

If all functions defined above are continuous and D 6= ∅, the set of optimal solutionsfor the problem (1.1) is not empty. To solve (1.1), most investigations have focused on thespecial case in which the structure of the mathematical model has a number of features, suchthat there is a single minimum, which is both local and global. This is true, for example, if fis a convex function and D is a convex set. Developed methods for convex problems requireonly local information. With information from one or more points, an approximation ofthe solution of the original problem is constructed, which is used to calculate a new samplepoint in the next iteration. These methods ensure convergence to the global minimum ifthe convexity property is given. Otherwise, this convergence is not assured.

Although many types of problems belong to the above class, there exists a wide varietyof problems where the existence of at least one minimum cannot be postulated or verified,complicating its solution.

If the feasible region is defined by linear constraints and the objective function is also lin-ear, the problem can be solved using linear programming. Conversely, if a function involvedin the problem is not linear, linear programming cannot be applied as a solution method.

Page 17: High Performance Computing in Deterministic Global ...

1.1 Main concepts of Global Optimization 3

All non-linear optimization techniques can locate local optima at least. However, there isno local criterion for deciding whether a solution is global. The fact that the optimizationproblem is non-linear, or non-convex, implies the possible existence of multiple local optima.Usually, the number of local optima is unknown and can be quite large. In addition, theobjective function value between local and global optimum can differ significantly. For thisreason, local solutions are not valid in most cases. The use of a local optimum, especiallyin economical problems, can result in the loss of millions of euros, even when the differenceis small between the local and the global optimum. Hence, Global Optimization can beextremely convenient.

Continuing with the analogy of finding the highest point on Earth, if you are looking inEurope and it is determined that the highest point of the continent is Mount Elbrus, onemight think that is the highest point in the world, as there is no higher than Mount Elbrusmountain in its surroundings.

According to [15], Global Optimization problems can be characterized according to thefollowing criteria:

• If all functions (the objective function and the constraints) are linear, the problem islinear. If there exists a non-linear function, the problem is non-linear.• The search space can be constrained by equality and/or inequality constraints or be

unconstrained.• The problem can be convex or non-convex.• A differentiable problem has a differentiable objective function and constraints. The

opposite is a non-differentiable problem.• If the variable x ∈ R, the problem is continuous. The problem is discrete if variable

x takes discrete values, for example x ∈ Z.

This classification is not exhaustive, because not all the cases are considered and the criteriaare not independent. For instance, if the function is differentiable, the problem has to becontinuous (x ∈ R).

One way to get the global minima is to determine all local minima and keep the best;although this approach is impractical, because many problems are characterized by a largenumber of local minima. Notice that the number of local minimum points can even beinfinite and not easy to characterize. Moreover, even the determination of a local minimumis not always easy. Most classical approaches cannot directly be applied to solve theseproblems, hampering their solution.

Naturally, under such circumstances, it is essential to use an appropriate global searchstrategy. Furthermore, instead of “exact” solutions, normally one has to accept variousnumerical approaches to the global solution set.

Such solution sets can result in hilly-landscape plots. For example, see Figure 1.2,which illustrates a relatively-simple composition of trigonometric functions with polynomialarguments f(x, y) = 0.2× (sin(x+ 4y)−2× cos(2x+ 3y)−3× sin(2x− y) + 4× cos(x−2y))on a two-dimensional search space. One can observe several local minima of the objectivefunction, and can immediately visualize the potential difficulty of the general problemstatement (1.1).

The first and sporadic works on Global Optimization emerged in the late fifties of XXcentury. The evolution since then has been very large. Hence, the today’s state of theart is characterized by dozens of monographs, an international journal (Journal of GlobalOptimization) and several thousand of research articles dedicated exclusively to this topic.

Page 18: High Performance Computing in Deterministic Global ...

4 Introduction

−5

0

5

−5

0

5

−2

−1

0

1

2

Figure 1.2: Three-dimensional representation of a function with multiple local minima

1.1.1 Pareto optimality

Optimization problems can be divided into those designed to find the optimal solution of asingle objective function and those designed to optimize a set of objective functions. Thesecond class of problems are known as multi-objective optimization problems.

Global Optimization techniques are not only used to find the maximum or minimum ofa function f . In many design problems or decisions, such techniques are applied to a set Fof p = |F | objective functions fi, where each function represents a criterion to be optimized:

F = fi : A→ Yi; i = 1, . . . , p; Yi ⊆ R. (1.2)

The mathematical foundations for multi-objective optimization which fairly considerscriteria in conflict, were established by Vilfredo Pareto in the late XIX century. Paretooptimality became an important concept in economics, game theory, engineering and socialsciences.

Figure 1.3 illustrates the concept of dominating solutions. Feasible solutions p1 and p2

dominate a feasible solution p3 if both p1 and p2 are better than p3 in at least one objectivefunction and they are no worse compared to other objective functions. A feasible solutionis said to be Pareto-optimal if it is not dominated by any other, i.e., if there does not existanother element that improves one of the objective functions without decreasing the otherobjective function values. In general, the solution to multi-objective optimization problemis not unique: the solution will consist of the set of all non-dominated points that give shapeof the Pareto set in the space of the objective functions.

1.1.2 Solution approaches

The desired qualities in a Global Optimization algorithm are given by:

Correctness: Not to produce incorrect results.Completeness: Find all possible solutions.

Page 19: High Performance Computing in Deterministic Global ...

1.1 Main concepts of Global Optimization 5

2

2

3

1

1f

f

p

p

p

Figure 1.3: Pareto front

Convergence: Ensure that the algorithm goes towards the solution.Certainty: Prove the existence or non-existence of solutions.

Only a few existing methods guarantee all the qualities enumerated in this list.There is not an universal method for solving optimization problems. The method is

chosen according to the characteristics of the problem, the quality requirements on results,and the response time of the algorithm. A general classification of optimization algorithmsmakes a distinction between deterministic and probabilistic algorithms.

Deterministic algorithms do not take random decisions in the search and convergenceproofs do not rely on chance, unlike probabilistic algorithms. Some of them offer a finitecompletion for specific problems and others converge when the number of iterations tendstowards infinity. A deterministic algorithm can be seen as a state machine where everystate has exactly one transition for each possible input.

Solving the problem in a deterministic way becomes difficult if the problem is not ana-lytically defined or the dimensionality of the search space is very large. Executing this kindof problems in a deterministic way could result in an exhaustive enumeration of the searchspace, which would not be tractable, in terms of execution time, even for relatively smallproblems. Here is where the probabilistic algorithms come into play. Early work in thisarea, which has now become one of the most important fields of research in optimization,began in the mid-twentieth century.

All probabilistic methods use some random factor in their algorithms and demonstra-tion of convergence depends on statistical arguments. Normally, probabilistic methods areapplied to problems without restrictions and without knowledge of the objective function,giving results close to the optimal ones and even sometimes the optimal solution. Thesealgorithms are based on obtaining values of the objective function in points of the searchregion chosen randomly but in a guided way. A disadvantage of these methods is that thereis no guarantee neither finding the solution is in a finite number of steps, nor the globalminimum is found.

Heuristics are used in Global Optimization schemes to help decide which part of a set ofpossible solutions is evaluated next. Deterministic algorithms often use heuristics to definethe order of processing solution candidates. On the other hand, probabilistic methodsconsider only those elements of the search space that have been heuristically selected.

Page 20: High Performance Computing in Deterministic Global ...

6 Introduction

According to [69], algorithms can be classified according to the rigour of their providedsolutions:

• An incomplete method uses intuitive heuristics to search, but is susceptible to gettingstuck in a local optimum.• An asymptotically-complete method reaches a global optimum if the algorithm

runs for a infinitely long period, but the method cannot check whether a global opti-mum has been found.• A complete method reaches a global optimum, assuming exact calculations and

infinitely-long execution time, knowing after a finite time if an approximate globalsolution has been found.• A rigorous method reaches a global optimum with certainty even in the presence of

rounding errors.

Often, the latter two categories of algorithms are characterized as deterministic. Never-theless, this characterization is slightly confusing, as many asymptotically incomplete andcomplete methods are also deterministic.

Complete (not to mention rigorous) methods guarantee (in exact arithmetic) to findthe global optimum with a predictable amount of work depending on the nature of theproblem, i.e., this type of methods guarantees the absence of systematic deficiencies thatavoid finding a global optimum. The limit on the amount of work is often very high, whichmay lead to a long execution time.

The simplest complete method for constrained problems is grid search [69], where thesearch space is covered by a grid and each point is analysed in search of a global optimum.Since the number of points in a grid grows exponentially with the dimension, the gridsearch is effective only when the number of dimensions of the problem is small. More effi-cient methods generally combine branching techniques with one or more local optimizationtechniques: convex analysis, interval analysis, and constraint programming.

Overall, complete methods (including approximate methods that reduce the problemto another one that can be solved in a reasonable time) are more reliable than incompletemethods. A good heuristic with probabilistic choices (similar, but usually simpler thanthose of incomplete methods) also plays an important role in complete methods, mainly toprovide, in a reachable way, feasible good points in favour of the full search.

In the sequel, two deterministic methods to solve Global Optimization problems will beintroduced: the Branch and Bound method, and Dynamic Programming.

1.2 Description of Branch and Bound

Branch and Bound (B&B) is by far the most widely used tool for solving large-scale hardcombinatorial optimization problems [17]. A B&B algorithm searches the complete spaceof solutions for a given problem for the best solution. However, explicit enumeration isnormally impossible due to the increasing number of potential solutions.

B&B algorithms appeared in the literature from the second half of the 20st century,where researchers described enumerative schemes for solving NP-complete problems. Dueto the generality and effectiveness of the method, these type of schemes are widely used [51]:enumeration problems in combinatorial mathematics, problem solving in artificial intelli-gence, and optimization in mathematical programming and operational research. In fact, itis still one of the best methods to tackle difficult problems. The name Branch and Boundwas given to this method by Little, Murty, Sweeney, and Karel in their innovative paper

Page 21: High Performance Computing in Deterministic Global ...

1.2 Description of Branch and Bound 7

about the travelling salesman problem [58]. Lawler and Wood studied B&B algorithms [55]and obtained a description independent from the problem, being the first paper that showsa general model of B&B algorithm.

B&B methods are algorithms based on search trees. The root node corresponds tothe original problem to be solved, and each other node corresponds to a subproblem ofthe original problem. The term subproblem is used to denote a problem derived from theoriginally-given problem through branching. The basic idea of B&B algorithms consists ofrecursively decomposing the original problem into disjoint subproblems until the optimalsolution is reached and its optimality proved. The search tree is developed dynamicallyduring the search and consists initially of only the root node. The method avoids visitingthose subproblems that are known not to contain any solution. The goal is to explore thewhole space without unnecessary partitioning.

If an approximate solution is found in early stages of the search, it helps to reduce thesearch. For many problems, a feasible solution is produced in advance using a metaheuristic,and the value hereof is used as the current best solution.

According to [50, 66], a B&B algorithm generally consists of four rules: branching,bounding, selection, and rejection. Depending on the problem type, a termination rule canbe added. The selection rule determines how the search is performed. The other rulesdepend on the problem to be solved. A brief description of the B&B rules is the following:

Branching: Determines how a subspace of the search space is divided in two or moresubspaces that cover the divided one.

Bounding: Calculates a lower bound of the optimum in a given subspace.Selection: Defines the subspace to be processed next.Rejection: Recognizes and rejects subspaces that do not contain an optimal solution of

the original problem.Termination: Given a required accuracy, the rule determines whether a subspace belongs

to the solution area.

A good understanding of the structure of the problem to be solved helps to choose the basicrules appropriately, reducing the computational burden and increasing the chance to solvelarge and complicated instances of a problem in a reasonable time.

The computational burden to solve a Global Optimization problem usually increasesexponentially with the dimension of the search space due to the performed exhaustivesearch. Therefore, parallel computation can be applied to solve this type of problems,reducing the computational time.

In the sequel, we focus on the main rules, because these are a critical issue for improvingthe computing performance and consequently the number of problems that can be addresseddue to their difficulty.

1.2.1 Branching rule

Depending on the problem, the search region can be divided into general polygons or intoother special sets, like triangles (see Big Triangle Small Triangle method [24]) or rectangles(see Big Square Small Square method [38]). Simplicial division will be studied in thisthesis. An n-simplex is a convex hull of n + 1 affinely independent vertices. A simplex isa polyhedron in a multidimensional space, with a minimal number of vertices. Thereforesimplicial partitions are preferred in Global Optimization when the values of the objectivefunction at all vertices of a partition are used to evaluate subregions [75].

Page 22: High Performance Computing in Deterministic Global ...

8 Introduction

If the subspace in question is subdivided into two, the term bisection branching is used,otherwise one talks about multisection branching.

1.2.2 Bounding rule

The bounding function is the key component of a B&B algorithm in the sense that a low-quality bounding function cannot be compensated for through good choices of branchingand selection strategies [17]. Ideally the value of a bounding function for a given subproblemshould equal the value of the best feasible solution to the problem. However, since obtainingthis value is usually in itself NP-hard, the aim is to come as close as possible using only alimited amount of computational effort. A bounding function is called strong if it in generalgives values close to the optimal value for the bounded subproblem, and weak if the valuesproduced are far from the optimum. One often experiences a trade-off between quality andtime when dealing with bounding functions: the more time spent on calculating the bound,the better the bound value usually is. In sequential B&B, it is considered beneficial to use abounding function as strong as possible in order to keep the size of the search tree as smallas possible. The use of bounds for the function to be optimized combined with the valueof the current best solution allows the algorithm to discard parts of the solution space.

1.2.3 Selection rule

The strategy for selecting the next subproblem to investigate usually reflects a trade-offbetween keeping the number of explored nodes in the search tree low, and staying withinthe memory capacity of the computer used [17]. An extensive mathematical study of theselection strategies can be found in [51].

The selection rule is an important factor for the performance of the designed B&Balgorithm. It affects the computing performance and memory requirement, but not theguaranteed convergence to the optimum. The generated paths in the search tree dependon the chosen selection rule. Depending on the instance of the problem, some selectionstrategies are more efficient than others. The most used selection criteria are: breadth-first,depth-first and best-first search. Hybrid methods combine the basic criteria. Considering aminimization problem, the basic criteria are:

Breadth-first search Chooses the node with the best lower bound among those at theleast depth of the search tree. To implement this procedure on a computer, a (FIFO,First In First Out) queue can be used conveniently. This strategy is not recommendedfrom the perspective of computing time or memory requirement. The number of nodesat each level grows exponentially with the level, making it infeasible to do a breadth-first search for large problems.

Depth-first search Selects the subproblem with the best lower bound among those atthe largest depth, in contrast with breadth-first search. To implement this procedureon a computer, a (LIFO, Last In First Out) stack can be used conveniently. Itsmain advantage is the small memory requirement. Moreover, good upper bounds ofthe solution can be obtained quickly. An advantage from the programming pointof view is the use of recursion to search the tree. The search method that is mosteconomical from the viewpoint of memory space is depth-first search. A disadvantageof depth-first search is that it tends to take time to exit, once it strays into an areaof the branching structure where no optimal solution of (1.1) is located. Therefore,the number of subproblems decomposed in the entire computation is usually largerthan those realized by other search strategies such as best-first search. Depth-first

Page 23: High Performance Computing in Deterministic Global ...

1.2 Description of Branch and Bound 9

search puts higher priority to those subproblems deeper into the tree. In this way, apreliminary solution (though it may not be optimal) is usually available even if thecomputation is interrupted prior to the normal termination.

Best-first search The subproblem with the best (smallest) lower bound is selected. Usingthis method, a subproblem is rarely decomposed unnecessarily. As a disadvantage,good upper bounds of the solution may be obtained at final steps of the algorithm.Even though, the choice of the subproblem with the current lowest lower bound makesgood sense also regarding the possibility of producing a good feasible solution. Mem-ory problems arise if the number of pending subproblems of a given problem becomestoo large.

Nodes with the same selection criterion value can be stored using a FIFO policy to reducethe insertion time. Other selection strategies have a second criterion for these cases, forinstance based on another value related to the subproblem.

As a hybrid method, one can combine depth-first and best-first searches applying al-ternatively each of them. First, a depth-first search is made until smaller subproblems inthe branch cannot be generated. A subproblem is selected from the working list fulfillinga best-first strategy. From this subproblem, a new depth-first search is initiated, and theprocess is repeated until the termination of the algorithm. This search criterion intends toobtain the advantages of the methods it is based on.

1.2.4 General B&B method

B&B works with an ordered list of subproblems, the working list Λ. The algorithms consistof a sequence of iterations where basic rules are applied to the data structure Λ and to thosesubproblems which are extracted from Λ. The rules select a subproblem from Λ (initiallythe complete search space), decompose it, and eventually insert the generated subproblemsin Λ. A feasible solution is associated to each subproblem, solving the subproblem in caseit is simple enough or assigning a solution chosen from feasible ones in the subproblem(not necessarily the best). The best feasible solution found during the run of the B&Balgorithm is an upper bound of the final solution and can be used to reject those generatedsubproblems which cannot contain a better feasible solution.

The algorithm starts with the data structure in initial state(

Λ, fU)

, where fU ≥ f∗

represents the initial upper bound on the optimal solution f∗ (possibly infinity), and endswith the final state (∅, fU ), where fU ≤ f∗ + ε is an approximate solution. The mainobjective of the rules is to reduce the search tree as quickly as possible to obtain the bestsolution. Although the rejection rule is carried out by the pruning of the tree, it is applied toa more or less extent depending on the other rules. A bounding rule that gets a good lowerbound of the possible solution of a subproblem, will better characterize the subproblemsthat can be removed. In order to eliminate subproblems, finding a good upper bound of thesolution is also needed. This upper bound is obtained by inspecting first the most promisingnodes, since they are those that are expected to provide better solutions.

The order of evaluation is determined by the selection rule, which therefore affects thememory requirement of the algorithm. We must distinguish between the total number ofinspected subproblems and the maximum number of subproblems stored in data struc-ture Λ. The difficulty in managing the data structure is directly related to the number ofsubproblems it contains. The branching rule plays an important role, since the number ofbranches generated from a search tree node depends on it. On the one hand, generatingmany branches allows more accurate information, because the generated subproblems aresmaller in size. On the other hand, more nodes of the tree should be inspected in each

Page 24: High Performance Computing in Deterministic Global ...

10 Introduction

division. Therefore, the branching rule should aim at reducing the search space withoutgenerating an excessive number of nodes in each division.

Non-rejected subproblems reaching the termination rule are stored in the final set Ω.Both Λ and Ω can be filtered looking for subproblems that can be rejected after the updatingof the upper bound of the solution. This filtering can be time-consuming depending on thesize of Λ and Ω, and how the elements are sorted.

1.2.5 Search anomalies in parallel B&B

Regarding the useful work, a parallel version of a given algorithm should perform the samecomputations or evaluations as the sequential one. Parallel B&B algorithms (tackled inPart I) may suffer from anomalies, because the parallel version can visit a different numberof subproblems than the sequential one. Two different types of anomalies can occur [54, 56]:

• Accelerating anomalies, when the number of evaluations done by the parallel versionis less than that performed by the sequential version. It occurs when a sharper upperbound fU is found in earlier stages of the parallel algorithm, than for the sequentialversion.• Detrimental anomalies, when the number of evaluations done by the parallel version

is less than that performed by the sequential version. It occurs when the parallelversion visits more branches of the search tree than the sequential one due to theunawareness of the update of the upper bound fU .

These anomalies can be detected by the Search Overhead Factor (SOF) value, defined asthe ratio between the work done by the parallel and the sequential versions.

In general, the use of best-first search and its variations leads to less anomalies thandepth-first search, but requires more memory. High memory requirement slows the execu-tion down due to data structure management and the speed/amount of the different levelsof the memory hierarchy in the system. Detrimental anomalies also increase execution time,because more subproblems are evaluated. Notice that if the global upper bound fU is ini-tiated with the global minimum f∗, the number of evaluated subproblems will be the sameindependently of the selection rule. Chapters 3 and 5 will refer to these search anomalies.

1.3 Description of Dynamic Programming

Dynamic Programming is a useful mathematical technique for making a sequence of in-terrelated decisions [47]. It provides a systematic procedure for determining the optimalcombination of decisions. Dynamic Programming applications can be found in variety ofareas including optimum control problems, distribution problems, Markovian decision pro-cesses, and calculus of variations.

The problem can be divided into stages, with a policy decision required at each stage.Each stage has a number of states associated with any of the situations that may occurin practice. The effect of the policy decision at each stage is to transform the currentstate to a state of the next stage (possibly according to a probability distribution). Thesolution procedure is designed to find an optimal policy for the overall problem, i.e., aprescription of the optimal policy decision at each stage for each of the possible states.Given the current state, an optimal policy for the remaining stages is independent of thepolicy decisions adopted in previous stages. Therefore, the optimal immediate decisiondepends only on the current state and not on how the system got there. This is the

Page 25: High Performance Computing in Deterministic Global ...

1.3 Description of Dynamic Programming 11

snState:

fn(sn, xn)

xn

S

fn+1(S)

pS

...

2

fn+1(2)

p2

1

fn+1(1)

p 1

Decision

Figure 1.4: Stochastic decision tree

principle of optimality for dynamic programming, introduced by R. Bellman in 1953 toformulate dynamic programming and stated as follows [8]:

An optimal policy has the property that whatever the initial state and decisionare, the remaining decisions must constitute an optimal policy with regard to thestate resulting from the first decision.

The solution procedure begins by finding the optimal policy for the last stage, which pre-scribes the optimal policy decision for each of the possible states at that stage. A recursiverelationship that identifies the optimal policy for stage n, given the optimal policy for stagen + 1, is available. When we use this recursive relationship, the solution procedure startsat the end and moves backward stage by stage: each time finding the optimal policy forthat stage until it finds the optimal policy starting at the initial stage. This procedure iscalled Backward Induction. This optimal policy immediately yields an optimal solution forthe entire problem. For some problems, the solution procedure can move either backwardor forward. However, for many problems (especially when the stages correspond to timeperiods), the solution procedure must move backward.

Dynamic Programming can be deterministic or stochastic. In deterministic DynamicProgramming problems, the state at the next stage is completely determined by the stateand policy decision at the current stage. In the stochastic case, there is a probability dis-tribution for what the next state will be. Stochastic Dynamic Programming differs fromdeterministic Dynamic Programming in that the state at the next stage is not completelydetermined by the state and policy decision at the current stage. Rather, there is a proba-bility distribution for what the next state will be. However, this probability distribution stillis completely determined by the state. The resulting decision tree for stochastic DynamicProgramming is depicted in Figure 1.4.

Dynamic Programming is a very useful technique for making a sequence of interrelateddecisions. It requires formulating an appropriate recursive relationship for each individ-ual problem. However, it provides a great computational savings over using exhaustive

Page 26: High Performance Computing in Deterministic Global ...

12 Introduction

enumeration to find the best combination of decisions, especially for large problems. Forexample, if a problem has 10 stages with 10 states and 10 possible decisions at each stage,then exhaustive enumeration must consider up to 10 billion combinations, whereas DynamicProgramming requires only a thousand calculations (10 for each state at each stage).

Some decisions need to take uncertainty about future events into account. Stochasticprocesses are those that evolve over time in a stochastic manner. Markov chains havethe special property that probabilities involving how the process will evolve in the futuredepend only on the present state of the process, and so are independent of events in the past.Stochastic processes are of interest for describing the behaviour of a system operating oversome period of time. The conditional probabilities for a Markov chain are called (one-step)transition probabilities. The (one-step) transition probabilities are said to be stationary.Thus, having stationary transition probabilities implies that the transition probabilities donot change over time.

For each possible state of the Markov chain, we make a decision about which one ofseveral alternative actions should be taken in that state. The action chosen affects thetransition probabilities as well as both the immediate costs (or rewards) and subsequentcosts (or rewards) from operating the system. We want to choose the optimal actions forthe respective states when considering both immediate and subsequent costs. The decisionprocess for doing this is referred to as a Markov Decision Process (MDP).

Many optimization problems arising in practice involve sequential decision making andcan be formulated as an MDP (MDP). An MDP is a general kind of model for stochasticDynamic Programming where the stages continue to recur for an infinitely time. SolvingMDPs numerically on a commodity computer1 is however restricted to problems that havea relatively small state space (few million states at most). An optimal solution, often calledan optimal strategy or an optimal policy, prescribes a best action to take in each individualstate that may occur. Many MDPs arising in practice tend to have too many states, dueto the dimensionality of the state space. An optimal solution can then not be computedin reasonable time unless parallel computing is used. Constructing an approximation al-gorithm requires a good intuition and insight in the problem under consideration. Theproblem often exhibits some structure, which can be very helpful in solving it. A generalapproach of how to exploit the problem structure does not exist for all problems, since thestructure may be problem specific. In this thesis, we illustrate how the formulation of aproblem as an MDP helps to construct optimal solutions.

1.4 High Performance Computing

High Performance Computing (HPC), also know as parallel computing, is a research fieldthat deals with two aspects: the hardware to process data in parallel (parallel architectures)and the software to exploit all the performance that a parallel machine can offer (parallelprogramming models).

Parallel computing is a combination of hardware systems, software tools, programminglanguages and parallel programming paradigms, that provides solutions to problems thatcannot be solved using a commodity computer. A problem could be intractable on acommodity computer for mainly two reasons: the size of the problem and the executiontime. There are some problems where the volume of data is greater than the availablememory in a commodity computer. Big problems can be decomposed into smaller ones andbe solved in parallel. The other limiting factor is the response time: some problems could

1A commodity computer is a standard-issue PC that is widely available for purchase.

Page 27: High Performance Computing in Deterministic Global ...

1.4 High Performance Computing 13

take millennia to be solved. A reasonable response time is an important factor, for instancein Medicine. The diagnosis of a disease must be as fast as possible to be treated on time.Nowadays, it is possible with the computational power of HPC platforms.

The ever-increasing computation power demand and rapid advances in very large inte-gration and communication technology have led to the development of HPC systems. Thesesystems enjoy parallelism at instruction, task, thread, and program levels. Some of the ma-jor areas of research and development are: on-chip networking, mapping and schedulingtasks, low-power design considerations, parallel programming tools, and multi-core algo-rithms.

HPC is also an excellent tool to simulate hard, expensive, slow and dangerous processeslike testing nuclear weapons, build large infrastructures like bridges, simulate the behaviourof a hurricane, etc. Researchers can take advantage of this powerful tool and carry outdozens of time-consuming calculations on large amount of data, obtaining the results in areasonable time.

More and more, parallel architectures are present in many places. Multi-core processorsare dominating all aspects of computing ranging from mobile devices to desktops and su-percomputers. A smartphone contains a processor with several execution cores. Nowadays,it is rare that a vendor offers a computer with a single-core processor. There is no needto have a large budget to benefit from parallel computing. Parallel computers can be builtfrom cheap components. Free UNIX-based distributions like Ubuntu (server edition) canbe used to deploy a parallel infrastructure.

The rapid growth and large availability of multiprocessor computers have raised thequestion of how to design parallel algorithms to all people wishing to process very largedata and difficult problems as fast as possible. Given a problem to solve, the first thing tounderstand is then what level of concurrency there exists in the problem, i.e., which taskscan be executed simultaneously and which cannot. It may be the case that the problem isnot adapted at all to the parallel setting and no or only very small speedup can be obtained.

To fully exploit the power of a parallel platform, users must increase their knowledgeabout HPC architectures and the different possibilities to code an algorithm in parallel.Not all parallel programming models are suitable for all platforms. Depending on thearchitecture, some models fit better into the machine characteristics than others. Thedeveloper should choose the right algorithm to suit the architecture.

TOP500 is a twice-yearly ranking of the 500 most powerful supercomputers in the world.Since 1993, parallel machine performance has been measured and recorded with the LIN-PACK Benchmark. This benchmark measures how fast a computer solves a dense n × nsystem of linear equations Ax = b, which is a common task in engineering. TOP500 rankingis a good source of information about what machines there are (were) and how they haveevolved. For the fifth consecutive time, Tianhe-2, a supercomputer developed by China’sNational University of Defense Technology, has retained its position as the world’s No. 1system, according to TOP500 list released on July 2015. Tianhe-2, which means MilkyWay-2, led the list with a performance of 33.86 petaflop/s (quadrillions of calculations persecond) on the LINPACK Benchmark. With 16,000 computer nodes, each comprising twoIntel Ivy Bridge Xeon processors and three Xeon Phi coprocessor chips, the system has atotal of 3,120,000 cores and 1,375 TiB of main memory [23].

Moreover, in almost all cases, supercomputers have to meet tight power consumptionrequirements. Energy efficiency from the viewpoint of computing is a trending topic nowa-days. Parallel computing contributes to this ecological movement making programs moreefficient. An alternative supercomputer ranking is the Green500 list. This list providesrankings of the most energy-efficient supercomputers in the world. Instead of using FLOPS

Page 28: High Performance Computing in Deterministic Global ...

14 Introduction

Main memory

Cache L3

Cache L2

Cache L1

RegistersCost and speed Size

Figure 1.5: Memory hierarchy

(FLoating point Operations Per Second) as reference, it uses FLOPS per watt.

Several classification schemes for parallel computers have been defined, but the first andmost commonly mentioned was proposed by Flynn in 1972 [27]:

SISD (Single Instruction Single Data): A computer equipped with a single-core processor.Nowadays, this type of processors are not common, even in smartphones or tablets.

SIMD (Single Instruction Multiple Data): A computer whose ISA (Instruction Set Ar-chitecture) contains instructions that can process data in parallel. An example ofthis type of computers are GPUs and vector processors. As of 2015 most commodityCPUs implement architectures that feature SIMD instructions.

MISD (Multiple Instruction Single Data): Several instructions perform operations on thesame data. This architecture is not common. This type of machines appears infault-tolerant systems.

MIMD (Multiple Instruction Multiple Data): Nowadays, this is the most common archi-tecture, where a system can process different data in parallel.

Memory hierarchy plays an important role on the design and performance of a code.The closer to the CPU the memory modules reside, the more expensive they are and theless storage capacity they have. In Figure 1.5, one can see the different levels that composethis hierarchy. The registers associated to a core appear on the first level. Cache memory(second level of the pyramid) is divided in two or three levels depending of the processorcost (L1, L2, and L3). Cache memory can hold both instructions and data. Dozens ofpapers are dedicated to techniques on how to exploit this memory in order to speed upa code. On the last level, main memory (also called RAM, Random Access Memory) ispresent.

HPC systems currently may integrate several resources such as multi-core processors,General-Purpose Graphic Processing Units (GPGPUs) and reconfigurable logic devices, likeField Programmable Gate Arrays (FPGAs) [25].

The terms many-core and massively multi-core are sometimes used to describe multi-core architectures with an especially high number of cores (tens or hundreds). This is thecase of accelerating devices as GPGPUs or Xeon Phi.

According to the different architectures, two main existing models for parallel algorithmdesign are presented: the shared-memory model and the distributed-memory model. Thisclassification is based on how the CPUs deal with the available memory.

Page 29: High Performance Computing in Deterministic Global ...

1.4 High Performance Computing 15

1.4.1 Shared-memory model

A multi-core processor is a processor with two or more cores or processing units. Thesecores can share cache memory and even a FPU (Floating Point Unit), like in the case ofAMD Bulldozer architecture. The availability of multi-core CPUs has given new impulseto the shared memory parallel programming approach [21].

In this model, the memory is visible to all (multi-core) processors. By shared variables orregions of memory, the processing units can interchange information. Two processing unitsmay not update the shared variable at the same time, it may produce data inconsistency.To avoid that inconsistency, a control of the update of the shared variables must be set.

In many systems, shared-memory is logically global but physically segregated. Thisleads to two further sub-classifications based on memory latency delays.

UMA (Uniform Memory Access): The latency to access an address in the logical memoryspace is the same for each CPU.

NUMA (Non-Uniform Memory Access): The latency to access an address in the logicalmemory space is determined by the physical distance from the CPU.

In shared-memory systems, the use of threads is more appropriate than the use of pro-cesses for the execution in parallel. A thread is the smallest process unit that an operatingsystem can schedule. A thread is a lightweight process having its own program counter andexecution stack. A process can launch several threads to execute different actions at thesame time. Threads share a common memory space, open files, etc. A process on the oper-ating system has a PID (Process IDentifier) while threads do not have it, because threadsare inside a process. A thread is more lightweight to manage compared to a process. Cre-ation, destruction and context switch are more expensive if processes are used. Threadedmodels are usually associated with shared memory and operating systems.

In shared memory multiprocessor architectures, threads can be used to implement par-allelism. Historically, hardware vendors have implemented their own proprietary versionsof threads, making portability a concern for software developers. Nowadays, several possi-bilities exist when one wants to code a threaded approach.

Pthreads: The Pthreads, or Portable Operating System Interface (POSIX) threads, isa standardized C language threads programming interface, specified by the IEEEPOSIX 1003.1c standard. The Pthreads library provides functions for creating andterminating threads. Other functions are intended for ensuring exclusive access toshared memory locations via locks and condition variables. The model is very flexible,but with a low level description. Programmers have to be aware of race conditionsand deadlocks when multiple threads access the shared data.

OpenMP: OpenMP (Open Multi-Processing) is a specification for a set of compiler direc-tives, library routines, and environment variables that can be used to specify high-levelparallelism [16]. The OpenMP API (Application Programming Interface) supportsmulti-platform shared-memory parallel programming in C/C++ and Fortran. TheOpenMP API defines a portable, scalable model with a simple and flexible interfacefor developing parallel applications on platforms from the desktop to the supercom-puter. Both data parallelism and task parallelism can be achieved from version 3 ofOpenMP.

TBB: TBB (Intel Threading Building Blocks) is a template-based library developed inC++ by Intel and widely used for task parallelism [80]. The aim of this library is tofacilitate writing code that exploits the parallel features of multi-core processors. Thelibrary provides an approach to develop parallel applications in C++, offering scalable

Page 30: High Performance Computing in Deterministic Global ...

16 Introduction

memory allocation and task scheduling. The advantage of TBB is that it facilitatesdeveloping loop and task-based applications with high performance and scalability,providing parallel algorithms and parallel data structures.

1.4.2 Distributed-memory model

A cluster is a system incorporating both shared and distributed-memory architectures (in-tegrating complete standalone HPC systems). The SMPs (symmetric multiprocessors) areconnected by a high-speed network like Infiniband. Normally, an Ethernet connection isalso present but it is used for maintenance purposes. In this model, each processor has itsown memory and there is no shared memory directly accessible by each processor. Proces-sors can communicate only via an interconnection network which connects each processorwith the others. For this kind of architectures, the model discussed in the previous sectiondoes not apply, because the memory is not shared among the CPUs. A model based onmessage passing must be adopted instead.

The Message Passing Interface (MPI) is a message passing library standard based onthe consensus of the MPI Forum, which has over 40 participating organizations, includingvendors, researchers, software library developers, and users. The goal of MPI is to establisha portable, efficient, and flexible standard for message passing that will be widely used forwriting message passing programs. As such, MPI is the first standardized, vendor inde-pendent, message passing library. The advantages of developing message passing softwareusing MPI closely match the design goals of portability, efficiency, and flexibility. MPI isnot an IEEE or ISO standard, but has in fact become the “industry standard” for writingmessage passing programs on HPC platforms.

The MPI standard has gone through a number of revisions, with the most recent versionbeing MPI 3, published in 2012. MPI 3.1 was subsequently released on 4th June 2015,containing minor fixes, changes, and additions compared to MPI 3.0. Although the MPIprogramming interface has been standardized, actual library implementations will differin which version and features of the standard they support. The way MPI programs arecompiled and run on different platforms will also vary. There exist several implementationsof MPI like MPICH, MVAPICH, OpenMPI, etc. The Open MPI Project is an open sourceand freely available MPI implementation that is developed and maintained by a consortiumof academic, research, and industry partners [28]. The Open MPI software achieves highperformance; the Open MPI project is quite receptive to community input.

1.4.3 Heterogeneous model

Devices that were firstly designed for graphical applications, are nowadays used as acceler-ating devices in general computation. These devices are called GPUs (Graphical ProcessingUnits). Sometimes, the “GP” (General Purpose) acronym is appended at the beginning ofthe GPU acronym: GPGPUs. GPGPU computing can be defined as the use of a graphicalprocessing unit (GPU) in combination with a CPU to accelerate many kind of applicationssuch as engineering, analysis and scientific computing. The GPGPU computing offers per-formance improvement by moving parts of the application with greater computational loadto the GPU and leaving the rest of the code running on the CPU. Instead of having aprocessor with a few cores, a GPU can have several hundreds of small cores that generallyoperate at lower frequencies than CPU cores. GPU cores are optimized for floating-pointsoperations.

Today, it is possible to have in a single system one or more host CPUs and one or more

Page 31: High Performance Computing in Deterministic Global ...

1.4 High Performance Computing 17

GPUs. In this sense, we can speak of heterogeneous systems. Therefore, a programmingmodel oriented towards these systems has appeared. The heterogeneous model is foresee-able to become a mainstream approach due to the microprocessor industry interest in thedevelopment of Accelerated Processing Units (APUs). An APU integrates the CPU (multi-core) and a GPU on the same die. This design provides a better data transfer rate andlower power consumption. AMD Fusion and Intel Sandy Bridge APUs are examples of thistendency.

The highly-parallel structure of GPUs makes them more effective than general-purposeCPUs for algorithms where processing of large blocks of data can be done in parallel. Thismeans they are suitable for problems involving matrices or multidimensional vectors. Nev-ertheless, not all applications match to this programming model. Problems that don’t mapwell are generally too small or too unpredictable. Very small problems lack the parallelismneeded to use all the threads on the GPU and/or could fit into a low-level cache on theCPU, substantially boosting CPU performance. Unpredictable problems have too manyconditional branches, which can prevent data from efficiently streaming from GPU mem-ory to the cores or reduce parallelism by breaking the SIMD paradigm [62, 20]. Examplesof these kinds of problems include: most graph algorithms (too unpredictable, especiallyin memory space), sparse linear algebra (but this is bad on the CPU too), small signalprocessing problems (FFTs2 smaller than 1000 points, for example), searching, and sorting.

To exploit the performance of this architectures, a widely-used API is CUDA (Com-pute Unified Device Architecture), developed by NVIDIA. CUDA also stands for parallelcomputing platform. CUDA allows software developers to use a CUDA-enabled GPU forgeneral purpose processing.

OpenCL (Open Computing Language) consists of an API for writing programs thatexecute across heterogeneous platforms. OpenCL specifies a language (based on C993)for programming these devices and application programming interfaces (APIs) to controlthe platform and execute programs on the compute devices. OpenCL provides parallelcomputing using task-based and data-based parallelism. OpenCL is an open standardmaintained by the non-profit technology consortium Khronos Group.

OpenACC (Open ACCelerators) is a programming standard for parallel computing de-veloped by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallelprogramming of heterogeneous CPU/GPU systems. accULL is a research implementationof the OpenACC standard with support for CUDA and OpenCL devices [81]. It is composedof a compiler driver based on yacf (a Python compiler framework for C) and a runtime envi-ronment called Frangollo. accULL has been developed by the High Performance ComputingGroup of University of La Laguna (Spain).

1.4.4 Hybrid model

A pure MPI code is not necessarily the best approach to obtain the maximum performance[85]. For the code to scale to a larger number of cores, several approaches exist. One is tocombine MPI with a threaded model, such as OpenMP or TBB, which has load balancingcapabilities, reducing the intra-node imbalance.

Combining the shared memory and distributed memory programming models is not anew idea. The goal is to exploit the strengths of both models: efficiency, memory savings,and ease of programming of the shared memory model with the scalability of the distributed

2Acronym of Fast Fourier Transform.3C99 (ISO/IEC 9899:1999) is a former version of the C programming language standard. The C11

version of the C programming language standard, published in 2011, replaces C99.

Page 32: High Performance Computing in Deterministic Global ...

18 Introduction

memory model. However, rather than developing new runtimes or languages, we can relyon mixing the already available programming models and tools. This approach is known ashybrid (parallel) programming. This programming model is a modern software trend for thecurrent hybrid hardware architectures. The basic idea is to use message passing (usuallyMPI) across the distributed nodes and shared memory (usually OpenMP or Pthreads)within a node. Hybrid programming can also involve the use of GPUs as source of computingpower [21]. Many possibilities arise to combine different programming models. MPI canbe combined with Pthreads or OpenMP. CUDA can be combined with MPI, Pthreads orOpenMP.

PGAS (Partitioned Global Adress Space) is a programming model that offers HPCprogrammers an abstracted shared address space, which simplifies programming, while ex-posing data/thread locality to enhance performance. This can facilitate the developmentof productive programming languages that can reduce the solution time, i.e., both develop-ment time and execution time.

1.4.5 Computing infrastructure used in this thesis

BullX-UAL is a cluster that belongs to the TIC-146 research group whose acronym is HPCA(High Performance Computing: Algorithms). This infrastructure has been used to carryout the experiments on the algorithms developed for this Ph.D. thesis. The cluster has atotal of 18 nodes. A node consists of two eight-core 2.00 GHz Intel Xeon E5-2650 (SandyBridge) processors.

Figure 1.6 shows the memory schema for a BullX-UAL node. The memory map hasbeen generated by the tool lstopo within the hwloc library [12]. A node has a total of 64GB of main memory. The memory is divided into two NUMA regions. Each memory regionconsists of 32 GB of main memory and a socket of eight cores. The eight cores shared acache L3 of 20 MB. Each core has an individual L2 cache of 256 KB and a L1 cache of 64KB (32 KB for instructions and 32 KB for data).

1.4.6 Parallel performance measurement

The goal is to use p processors to make a code run p times faster. Speedup is the factor bywhich the program’s speed improves when the number of processors is increased:

Sp =T1

Tp

, (1.3)

where T1 is the best sequential wall-clock time and Tp is the parallel wall-clock time usingp processing units. The speedup is the ratio of the running time of the fastest knownsequential implementation to that of the parallel running time. The relative speedup iseasier to measure, because it uses the time required by the parallel algorithm running ona single core as sequential time. However, note that the relative speedup can significantlyoverstate the actual usefulness of the parallel algorithm, since the sequential algorithm isusually faster than the parallel algorithm run on a single processor [71].

Sometimes p processors can achieve a speedup greater than p as a result of non-efficientsequential algorithm. In other cases, this anomaly can legitimately occur, because of cacheand memory effects. More processors typically also provide more memory/cache. Totalcomputation time decreases due to less page/cache misses.

Efficiency is a performance metric defined as:

Ep =Sp

p=

T1

pTp

. (1.4)

Page 33: High Performance Computing in Deterministic Global ...

1.4 High Performance Computing 19

Machine (64GB total)

NUMANode P#0 (32GB)

Socket P#0

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#0

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#1

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#2

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#3

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#4

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#5

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#6

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#7

NUMANode P#1 (32GB)

Socket P#1

L3 (20MB)

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#0

PU P#8

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#1

PU P#9

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#2

PU P#10

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#3

PU P#11

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#4

PU P#12

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#5

PU P#13

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#6

PU P#14

L2 (256KB)

L1d (32KB)

L1i (32KB)

Core P#7

PU P#15

Figure 1.6: Memory map for a BullX-UAL node

Page 34: High Performance Computing in Deterministic Global ...

20 Introduction

Ep is a value, typically between zero and one, estimating how well-utilized the processorsare in solving the problem, compared to how much effort is wasted in parallel overhead ascommunication and synchronization.

The execution time depends on what the program does. A parallel program spends timeon: work, synchronization, communication, and the extra work (overheads) to handle theparallelization. A program implemented for a parallel machine is likely to do more workthan a sequential program, even when running on a single processor machine.

Parallel overheads are mainly due to communications, although there exist other causessuch as memory management and the inclusion of new code to handle the parallelism. Theinter-node communications are the most time consuming. Communications within a nodedepend on the node memory design. UMA design gives a better performance than NUMAdesign. Consequently, one of the goals of a parallel algorithm is to achieve a trade-offbetween the reduction of the parallel overhead and to maintain the cores busy by doinguseful work.

All parallel programs contain both parallel and serial regions. Amdahl’s law says thatthe performance improvement obtained by the alteration of one of the components is limitedby the fraction that this component is employed. In parallel computing terms, the speedupof an algorithm is limited by the fraction of code suitable to be parallelized.

In the case of parallelization, Amdahl’s law states that if P is the proportion of aprogram that can be made parallel (i.e., benefit from parallelization), and (1 − P ) is theproportion that cannot be parallelized (remains serial), then the maximum speedup thatcan be achieved by using N processors is

S(N) =1

(1 − P ) + PN

. (1.5)

Serial sections limit the parallel performance. Factors that affects the parallel performanceare: load balancing, granularity, communication patterns, data dependency, and synchro-nization.

Load balancing refers to the practice of distributing work among processes so thatall are kept busy during the complete running time. Load balancing tries to minimizethe process idle time. For example, if all processes have a barrier synchronization point,the slowest process will determine the overall performance. The load balance dependson a good data decomposition, i.e., dividing up the problem data so that each processorhas an even distribution of work to compute. Data decomposition strategies are staticdecomposition and dynamic decomposition. Static decomposition is not advisable when thepending workload of the elements is not known beforehand, and dynamic load balancingstrategies are then needed.

To measure the effectiveness of the load balancing strategy, the Relative Load Imbalance(RLI) ratio can be used [82]. Let Wtot be the total amount of computation performed by pworking units and Wi, i = 0, . . . , p− 1, the amount of computation performed by process isuch that

∑p−1i=0 Wi = Wtot. Let Wmax = maxi Wi, then the RLI is defined as

RLI = 1− Wtot

pWmax

. (1.6)

RLI takes values in the interval [0, 1−1/p]; a value close to zero shows a small load imbalance.For the sake of simplicity, the value of RLI shown in the sequel is normalized in the range[0, 100].

Granularity is the relationship between the amount of computation and communication.It is a measure of how much work gets done before processes have to communicate. Max-imum parallelization does not mean maximum speedup. Parallelization requires effort to

Page 35: High Performance Computing in Deterministic Global ...

1.5 Research questions 21

decompose the problem and to deal with communication overhead. It is essential to reachan appropriate grain size to maintain a trade-off between computing and communication.

Parallelism overhead includes the cost of starting a thread or process, the cost of com-municating shared data, and the cost of synchronizing, among others. An algorithm usuallyneeds sufficiently large units of work to run fast in parallel (i.e., large granularity), but notso large that there is not enough parallel work for the parallel architecture at hand.

Synchronization is the use of language or library mechanisms to constrain the ordering(interleaving) of instructions performed by separate threads, to preclude orderings that leadto incorrect or undesired results [71]. It is a critical design consideration for most parallelprograms. It can be a significant factor in program performance (or lack of it). Shared-memory implementations of synchronization can be categorized as busy-wait (spinning), orscheduler-based. The former actively consume processor cycles until the running thread isable to proceed. The latter deschedule the current thread, allowing the processor to beused by other threads, with the expectation that future activity by one of those threadswill make the original thread runnable again.

Condition synchronization forces a thread to wait, before performing an operation onshared data, until some desired precondition is true. Examples of this synchronizationmechanisms are barriers, locks or semaphores.

A synchronization barrier, used to separate phases, guarantees that no thread continuesto phase n + 1 until all threads have finished phase n. Given the importance of barriersin scientific applications, one can find contributions in the literature to reduce the overallwait time in a barrier [32, 33, 84].

A lock or semaphore is used to serialize (protect) access to global data or a section ofcode. Only one thread at a time may use (own) the lock. Programming discipline commonlyensures this property by associating data with locks. If the shared variable is updated somany times, it can cause a bottleneck.

Contention for shared resources such as memory and an interconnection network limitthe performance of shared-memory multiprocessors [86].

1.5 Research questions

The investigation of this thesis started with the premise that HPC may be a useful toolto make B&B algorithms and Dynamic Programming tractable for hard to solve problems.The main question is how to develop parallel versions that are efficient on modern platforms.

The investigation of this main question uses a case study approach where additionalquestions are posed on implementations of algorithms. In Part I, several cases were studiedon Global Optimization B&B, one case on blending problems and another on simplicial B&Bfor multidimensional Global Optimization. In Part II, a case on Traffic Control has beeninvestigated. For all cases, specific algorithms have been designed and their parallelizationon modern platforms has been investigated.

Blending problems consist of designing a product given a set of raw materials. Thegoal is to design a product with the minimum cost and using the minimum number ofraw materials. This problem is considered a multi-objective Global Optimization problem,because we have two objective functions to minimize: the cost function and the number ofraw materials function. The investigated problem is the design of two products that shareraw materials: the bi-blending problem. A practical complication is that the two productsmay share the same raw materials if the amount of raw material is limited (scarce). Theresearch question, which focuses on this problem, is as follows:

Page 36: High Performance Computing in Deterministic Global ...

22 Introduction

Research Question 1 – Which considerations must be taken into considerationto design a B&B algorithm to solve the bi-blending problem with scarce rawmaterials?

This question is investigated in Chapter 2 by exploiting the mathematical characteristicsof the problem and designing several algorithms.

After having investigated the bi-blending problem, a remaining challenge is that thesolution time increases fast with the size of the problem. When the number of ingredientsis seven, the B&B procedure takes a long time compared to recipes with five or threeingredients. Due to weak lower bounds, once the B&B algorithm designed in Chapter 2 isfinished, the remaining search area that cannot be discarded contain subspaces that are notfeasible. An exhaustive test to find the infeasible subspaces must be carried out. This isa long process to be performed sequentially. The research question, which focuses on thisproblem, is as follows:

Research Question 2 – Which HPC techniques can be applied to the B&B algo-rithm and the subsequent filtering to speed up the response time?

This question is investigated in Chapter 3. Several strategies have been applied usingPthreads.

The mixture design problem involves the use of simplicial partitioning. This is mostlydone by bisecting the longest edge of sub-simplices. Our investigation found that theresulting search tree depends on the choice of the longest edge in the bisection process.Additionally, the order in which the simplices are evaluated plays an important role in theperformance of the algorithm. An efficient memory management can lead to a reducedexecution time of the algorithm. Two logical questions to solve Multidimensional GlobalOptimization problems are the following:

Research Question 3a – How does the Longest Edge selection strategy, used inthe branching rule, affect the efficiency of the B&B algorithm?

Research Question 3b – How does the search strategy, used in the selection rule,affect the efficiency of the B&B algorithm?

This question has been empirically investigated in Chapter 4 using two sets of Global Opti-mization functions, consisting of a total set of 20 functions, has been used to experimentallyanalyse the influence of the above-mentioned B&B rules.

When the accuracy imposed to solve this problem is high, the number of evaluatedsubspaces is large as well as the computational time. The evaluation of the subspaces canbe performed in parallel. The research question, which focuses on this problem, is as follows:

Research Question 4 – Which models are appropriate to efficiently map B&Balgorithms on a cluster architecture?

This question has been investigated in Chapter 5 using a HPC platform with the desiredcharacteristics, namely BullX-UAL.

The second part of the thesis deals with the question of implementing Dynamic Pro-gramming algorithms on HPC platforms and its challenge due to the curse of dimensionality.Therefore, the thesis studied the case of the generation of so-called Traffic Control Tablesto manage the vehicle flows in an intersection. This problem can be modelled as a MarkovDecision Process. An investigated research question for this specific model is:

Research Question 5 – What is the advantage in reduction of the average numberof waiting cars if the control makes use of the arrival information in the model?

Page 37: High Performance Computing in Deterministic Global ...

1.5 Research questions 23

Chapter 6 investigates this question based on analysing two infrastructures and solvingthem by Backward Induction.

The curse of dimensionality provides a challenge as the memory requirement exceeds thecapacity of a commodity computer. This is an example where the use of a supercomputer isneeded, not only to reduce the response time of the algorithm, but to solve instances witha high memory requirement. The research question, which focuses on this problem, is asfollows:

Research Question 6 – Which is the best way to handle a large number of statesin parallel?

This question is investigated in Chapter 7.The character of the challenges and questions described here is twofold. From a math-

ematical viewpoint, the idea is to develop an efficient algorithm. From a computer scienceviewpoint, the idea is to fully exploit the parallel performance that an architecture at handcan achieve.

Page 38: High Performance Computing in Deterministic Global ...
Page 39: High Performance Computing in Deterministic Global ...

Part I

Branch and Bound

25

Page 40: High Performance Computing in Deterministic Global ...
Page 41: High Performance Computing in Deterministic Global ...

CHAPTER

2Branch and Bound

applied to thebi-blending problem

The mixture design problem for two products concerns finding simultaneously tworecipes of a blending problem with linear, quadratic and semi-continuity constraints. Asolution of the blending problem minimizes a linear cost objective and an integer valuedobjective that keeps track of the number of raw materials that are used by the two recipes,i.e., this is a bi-objective problem. Additionally, the solution must be robust. We focus onpossible solution approaches that provide a guarantee to solve bi-blending problems witha certain accuracy, where two products are using (partly) the same scarce raw materials.The bi-blending problem is described and a search strategy based on Branch and Bound(B&B) is analysed. Specific tests are developed for the bi-blending aspect of the problem.The whole is illustrated numerically.

2.1 Introduction

Finding a cheap robust recipe for a blending problem that satisfies quadratic design re-quirements is a hard problem. In practice, companies are also dealing with so-called multi-blending problems where the same raw materials are used to produce several products.Descriptions from practical cases can, among others, be found in [6, 11]. This complicatesthe search process for feasible and optimal robust solutions if we intend to guarantee theoptimality and robustness of the final solutions.

Section 2.1.1 describes the blending problem. Section 2.1.2 defines the blending problemto obtain two mixture designs (bi-blending).

27

Page 42: High Performance Computing in Deterministic Global ...

28 Branch and Bound applied to the bi-blending problem

2.1.1 Blending problem

The blending problem is the basis of our study of the bi-blending. The considered blend-ing problem is described in [41] as a Semi-continuous Quadratic Mixture Design Problem(SQMDP). Here we summarize the main characteristics of the blending problem.

In fodder and chemical industry, among others, raw materials are put together andprocessed to end products. The easiest process is to mix, or to blend, the components (rawmaterials) as done in a cocktail bar or composing financial products. The mathematicaldescription concerns variables xi representing the fraction of the component i in a recipe x.The set of possible mixtures is mathematically defined by the unit simplex

S =

x ∈ Rn :

n∑

i=1

xi = 1.0; xi ≥ 0

,

where n denotes the number of raw materials.In mixture design (blending) problems, the objective is to find a recipe x that minimizes

the cost of the material, f(x) = cT x, where vector c gives the cost of the raw materials.In practical situations, such problems are solved on a daily basis in industry where oftenrequirements are modelled by linear inequalities, see e.g. [6, 11, 89]. Due to a large projecton product design, a study was done on how to deal with quadratic requirements, withsemi-continuity and how to generate robust products, see [13, 41].

Not only the cost of the material should be minimized, but also the number of rawmaterials in the mixture x given by

∑n

i=1 δi(x), where

δi(x) =

1 if xi > 0,

0 if xi = 0.

The semi-continuity of the variables is due to a minimum acceptable dose (md) thatthe practical problems reveal, i.e., either xi = 0 or xi ≥ md. Figure 2.1 shows a graphicalexample of the search space in 2D (left hand side) and 3D (right hand side) consisting ofunit simplices removing the space where the minimum dose constraint is not satisfied. Thenumber of resulting sub-simplices (faces) is

n∑

t=1

(

n

t

)

= 2n − 1, (2.1)

where t denotes the number of raw materials in each sub-simplex. All points x in an initialsimplex Pu, u = 1, . . . , 2n−1, are mixtures of the same group of raw materials. The index u,representing the group of raw materials corresponding to initial simplex Pu, is determinedby

u =n∑

i=1

2i−1δi(x), ∀x ∈ Pu. (2.2)

Recipes have to satisfy certain requirements. For relatively simple blending problems,the bounds or linear inequality constraints

hi(x) ≤ 0; i = 1, . . . , l, (2.3)

define the design space X ⊂ S.

Page 43: High Performance Computing in Deterministic Global ...

2.1 Introduction 29

x2

1.0

1.0 One raw material

Two raw materials

md

md0.0

x1

x1x2

x3

Figure 2.1: 2D and 3D simplices removing the minimum dose region

In practice, however, constraints are not only linear. The difficulty increases with thedimension and the existence of quadratic inequalities that represent product specifications.Quadratic constraints are written as

gi(x) = xT Aix + bTi x + di ≤ 0; i = 1, . . . , m, (2.4)

in which Ai is a symmetric n by n matrix, bi is an n-vector and di is a scalar. The quadraticfeasible space is defined as

Q = x ∈ S : gi(x) ≤ 0; i = 1, . . . , m.

Mixture x is called (linearly and quadratically) feasible if (2.3) and (2.4) apply.For practical application of producing the recipes to manufacture the product, so-called

ε-robustness is desired with respect to the quadratic requirements in order not to be “outof specification” as soon as a product is composed from recipe x. One can define robustnessR(x) of a design x ∈ Q with respect to Q as

R(x) = maxR ∈ R+ : (x + r) ∈ Q, ∀r ∈ R

n, ‖r‖ ≤ R. (2.5)

Notice that the fact that a product x has a certain (positive) robustness, implies that x ∈ Q.Keep in mind that for mixture problems, x + r is projected on the unit simplex. In [42], ananalytical expression of the robustness for linear mixture design problems is given and it isshown that, for quadratic inequalities, this expression does not exist. Lower bounds RL(x)are derived that can be used in a B&B algorithm. This means, one can identify areas wherean ε-robust solution cannot be located.

As analysed, the problem of finding the best robust recipe becomes a Global Optimiza-tion (GO) problem for which a guaranteed optimal solution is hard to be obtained. Likemany GO problems, blending problems can have several local optima. The feasible areamay be nonconvex and even consist of several compartments.

In this chapter, our focus is on finding the best recipes if one wants to design twomixture products that share raw materials; the so-called bi-blending problem. Due tocapacity constraints and availability of raw materials, the recipe that looks the best for oneproduct is not necessarily the best when designing both products simultaneously.

Page 44: High Performance Computing in Deterministic Global ...

30 Branch and Bound applied to the bi-blending problem

2.1.2 Bi-blending problem

Companies use raw materials to make products. In fact, they often make several productsusing (partly) the same raw materials. Each of the products has its own demand and qualityrequirements consisting of design constraints. In industry, manufacturers sometimes facethe problem of shortage of raw materials for the products they want to make. This problemrequires bigger dosage of other ingredients, such that the optimal solution for one productis not always the overall solution. The scarcity of raw materials is described by capacityconstraints.

A usual way to describe the bi-blending problem is to identify an index j for eachproduct with demand Dj . The amount of available raw material i is given by Bi. Now, themain decision variable xi,j is the fraction of raw material i in recipe of product j. Let x∗,j

represent column j of matrix x of decision variables. Then we have linear restrictions perproduct: x∗,j ∈ Xj ; quadratic requirements per product: x∗,j ∈ Qj ; and the correspondingrobustness Rj(x∗,j).

In principle, all final products can make use of all n raw materials; x∗,j ∈ Rn, j = 1, 2.This means that xi,1 and xi,2 denote fractions of the same ingredient for products 1 and2. In practice, there is a preselection of which raw materials can be used for each finalproduct. An alternative description is to have different number and type of raw materialsnj per final product and to define, with index sets, which of the raw materials are shared.

The main restriction that gives to the bi-blending problem the “bi” character are thecapacity constraints:

2∑

j=1

Djxi,j ≤ Bi; i = 1, . . . , n. (2.6)

Adding the capacity constraints to the problem, the cost function takes two productsinto account with their individual demands. The cost function of the bi-blending problemcan be written as:

F (x) =2∑

j=1

Djf(x∗,j).

Redefining the optimization criterion on the number of distinct raw materials havingtwo mixtures x∗,1 and x∗,2 sharing ingredients, the function to minimize is

ω(x) =n∑

i=1

δi(x∗,1) ∨ δi(x∗,2), (2.7)

where ∨ denotes the bitwise or operation.We are dealing with the minimization of the number of raw materials and the cost of

the mixture for both products. Figure 2.2 shows how an ε-robust solution is dominatedby another ε-robust solution with less cost and less or equal number of raw materials.The so-called Pareto front consists of minimum costs F ⋆

p for each number of raw materialsp = 1, . . . , n. The solutions of the problem consist of corresponding Pareto optimal bi-blending recipe-pairs x⋆

p.The Quadratic Bi-Blending problem (QBB) can be defined as follows:

min F (x), ω(x)s.t. x∗,1 ∈ X1 ∩Q1, x∗,2 ∈ X2 ∩Q2

Rj(x∗,j) ≥ ε; j = 1, 2∑2

j=1 Djxi,j ≤ Bi; i = 1, . . . , n

(2.8)

Page 45: High Performance Computing in Deterministic Global ...

2.2 Algorithm for finding a solution 31

o

oo o

...

o

o

o

...

1 2 3 4 5 6 7 8

o

Co

st

F

Number of raw materials p

Dominated by 4 and 5

Dominated by 5

Non-optimal solutions

New trial

Best ε-robust solution found

Figure 2.2: Rejection by domination, Pareto optimality

2.2 Algorithm for finding a solution

Local optima of the QBB problem without robustness can be obtained by using standardsoftware like GAMS/BARON or MATLAB solvers. Robustness cannot be modelled in astraightforward way, only lower bounds can be generated. Moreover, the problem inheritsthe multi-extremal character from the single quadratic blending problem as described in[13]. Solving the QBB problem in an exhaustive way (the method obtains all global solutionswith the established precision) requires the design of a specific B&B algorithm.

We are interested in methods that find solutions xp of the bi-blending problem up to aguaranteed accuracy, e.g.

F (xp)− F ⋆p ≤ δ. (2.9)

Theoretically this can be accomplished by sampling dense up to an α-accuracy in bothspaces by taking step-size α small enough. As discussed in [13], due to F being linear inx, one can generate a regular grid where all points are less than α apart, which is not veryefficient. B&B method will be used in order not to generate all the points and therefore beefficient.

We will denote a simplex by Ck, where k determines the order in which the simplexwas generated and tk specifies the number of raw materials of Ck. In the algorithm, all tk

vertices of a simplex Ck, denoted by vk,s, s = 1, . . . , tk, are evaluated; i.e., the values of thequadratic constraints gi,j(vk,s), i = 1, . . . , mj , the linear constraints hi,j(vk,s), i = 1, . . . , lj ,and the cost value f(vk,s) are determined (line 8 of Algorithm 2.1). Global upper boundvalues F U

p , p = 1, . . . , n, are updated by the sum of the cost values of a pair of feasibleand robust vertices, one for each product. The algorithm keeps track of F U

p , which isinitially set to infinity. If a new (linearly) feasible and ε-robust vertex is generated bydividing a simplex (line 14), it is combined with all feasible and ε-robust vertices of theother product in order to find a pair of vertices that satisfies the capacity constraints (2.6)

Page 46: High Performance Computing in Deterministic Global ...

32 Branch and Bound applied to the bi-blending problem

Algorithm 2.1 Branch and Bound algorithm for the QBB problem

1: Set ns := 2× (2n − 1) Number of simplices2: Set the working list Λ1 := C1, . . . , C2n−13: Set the working list Λ2 := C2n , . . . , Cns4: Set the final lists Ω1 := ∅ and Ω2 := ∅5: Set j := 16: while Λ1, Λ2 6= ∅ do7: Select a simplex C = Ck from Λj Selection rule8: Evaluate C9: Compute fL(C) and bL

i (C), i = 1, . . . , n Bounding rule10: if C cannot be eliminated then Rejection rule11: if C satisfies the termination criterion then Termination rule12: Store C in Ωj

13: else14: Divide C into Cns+1, Cns+2 Branching rule15: if a new vertex is evaluated then16: Check (and update) F U

p and xp

17: end if18: C := arg minfL(Cns+1), fL(Cns+2) Select the cheapest simplex19: Store Cns+1, Cns+2 \ C in Λj

20: ns := ns + 221: Go to line 822: end if23: end if24: j := (j mod 2) + 1 Alternate product25: end while26: return xp, F U

p , p = 1, . . . , n, and Ωj , j = 1, 2

and improves the Pareto front F Up (line 16). Algorithm 2.1 returns the Pareto vector F U

p

and the corresponding mixtures xp, p = 1, . . . , n.

Algorithm 2.1 starts by generating the initial set of 2n−1 sub-simplices for each productj, as a partitioning of the search space (see Equation (2.1)), resulting from removing theminimal dose region from the original simplex (see Figure 2.1). These initial sets of simplicesare stored in the working lists Λj, j = 1, 2 (lines 2 and 3). While the working lists are notempty, a simplex Ck from Λj is selected and evaluated (lines 7 and 8). If Ck cannot beeliminated (line 10) and neither satisfies the termination criterion (line 11), it is bisected.From the two generated simplices, the more expensive simplex is stored in the work list Λj

while the algorithm proceeds with the cheapest one (depth-first search). Those simpliceswhich satisfy the termination criterion are stored in the final list Ωj , that determines theset of all simplices where the global ε-robust optimal mixture can be located (if any). Thealgorithm stores the best pair xp of feasible and ε-robust mixtures found for each number ofraw materials p = 1, . . . , n, as output of the algorithm. The algorithm alternates betweenlists Λj, j = 1, 2 (line 24) when Ck is stored in final list Ωj or rejected.

In the next subsections, a detailed description of the rules of Algorithm 2.1 is provided.

Page 47: High Performance Computing in Deterministic Global ...

2.2 Algorithm for finding a solution 33

1

2

$100$200

$300

Figure 2.3: Example of division according to the raw material costs

2.2.1 Branching rule

The branching rule applied in Algorithm 2.1 consists of: given a simplex Ck with tk ≥ 2, itslongest edge is bisected generating two new simplices. Using this branching rule, the lengthof the longest edge is at most twice the length of the shortest edge. Therefore, simplicesnever get a needle shape [18, 48]. If all edges are of equal length, the edge connecting thecheapest and most expensive vertex is bisected. Consequently, the branching rule generatesan expensive and a cheap simplex. This helps the selection rule (Section 2.2.4) which giveshigher priority to “cheaper” simplices. Notice that this branching rule does not alwaysgenerate a new point, because the generated vertex can be shared by different simplicesand may have been evaluated already.

Figure 2.3 shows an example of the branching rule of Algorithm 2.1. First, the vertexlabelled with 1 is generated taking into account the value of the cost at the vertices ofthe equilateral simplex. Vertex 2 is generated by bisecting the longest edge. Dashed linesrepresent future subdivisions that end in a vertex (drawn as a square) which is shared byfour simplices. A discussion about lower and upper bounds on the number of simplicesgenerated in the worst case by bisection can be found in [13].

2.2.2 Bounding rule

The lower bound of the objective cost function f on Ck is obtained as

fLk = min

s=1,...,tk

f(vk,s)

due to the linearity of f and the convexity of Ck.The amount of raw material i used by mixtures x ∈ Ck is bounded below by

bLi (Ck) = min

x∈Ck

xi (2.10)

for x = vk,s, s = 1, . . . , tk, due to the linearity of b and the convexity of Ck.These lower bounds are used by the rejection rule as described in Section 2.2.5.

Page 48: High Performance Computing in Deterministic Global ...

34 Branch and Bound applied to the bi-blending problem

2.2.3 Termination rule

A simplex is not divided further when its size is smaller than the value of the accuracy α.The size is given by the length of its longest edge, i.e.,

Size(Ck) = maxv,w∈Ck

‖v − w‖. (2.11)

If Size(Ck) ≤ α and Ck has not been rejected by the elimination rule, it is saved in a finallist Ωj . Algorithm 2.1 finishes when the working lists Λj, j = 1, 2, are empty.

2.2.4 Selection rule

The selection rule has been designed to achieve two goals: to facilitate discarding simpliceswith a large number of raw materials and to reduce the memory requirement. The first goalis met by giving priority to simplices with fewer number of raw materials and lower cost.Simplices with more raw materials and/or higher cost value can be dominated by thosewith less raw materials, which improves the efficiency of the Pareto test (see Section 2.3.2).In the algorithm, the cost priority of the simplex is measured by the sum of the costs at itsvertices, i.e.,

Priority(Ck) =tk∑

s=1

f(vk,s), (2.12)

where lower values of (2.12) denote more attractive simplices.The second goal is met by applying a depth-first selection rule. Once a simplex is selected

and divided, its cheapest child is selected as next simplex until no further subdivision isallowed. This reduces the memory requirement of the algorithm.

2.2.5 Rejection rule

A simplex Ck is discarded if it is proven not to contain a mixture that fulfils the followingrequirements:

• Linear feasibility.• ε-robustness, implying quadratic feasibility.• Availability of enough amount of raw material.• Pareto optimality.

The first two conditions are checked using the tests taken from [13, 41] and are outlinedbelow. Two additional tests, to reject simplices according to the available amount of rawmaterials and the Pareto optimality, are elaborated in Section 2.3. The order in whichthese tests are applied does not affect the final solution but it may influence the efficiency,in terms of the computational effort (time).

Linear infeasibility

If for one of the linear constraints hi(x) ≤ 0 all vertices are infeasible (hi(vk,s) > 0, s =1, . . . , tk), then Ck does not fulfil hi(x) ≤ 0.

Page 49: High Performance Computing in Deterministic Global ...

2.3 New bi-blending rejection rules 35

Quadratic infeasibility

Given a simplex Ck ∈ Λj and the set of quadratic constraints gi(x) ≤ 0, i = 1, . . . , mj,Ck can be rejected if ∀x ∈ Ck ∃ i : gi(x) > 0.

Around each vertex vk,s, s = 1, . . . , tk, a so-called infeasibility sphere Bk,s is defined:

Bk,s = x ∈ Rn : ‖x− vk,s‖2 < ρ2

k,s.

Bk,s cannot contain a feasible point. Different ways of calculating values of radii ρk,s aredescribed in [13].

If none of the vertices of Ck is feasible, the spheres can be used to prove that Ck cannotcontain a feasible quadratic solution. The following tests to prove the infeasibility of simplexCk are used by the algorithm:

SCTest (Single Cover Test). One of the spheres Bk,s with radius ρk,s, s = 1, . . . , tk, coversthe simplex completely, i.e., Ck is proved infeasible if there exists a vertex vk,s suchthat ρk,s > maxz ‖vk,s − vk,z‖, z 6= s, z = 1, . . . , tk.

MCTest (Multiple Cover Test). A simplex Ck which is not covered by a single sphereBk,s can still be covered by ∪tk

s=1Bk,s. In [13], it is proven that if a mixture x ∈ Ck

is covered by all spheres, i.e., x ∈ ∩tk

s=1Bk,s then Ck ⊂ ∪tk

s=1Bk,s. This means that allthe possible mixtures in Ck are covered by at least one sphere Bk,s and consequentlyCk is infeasible.

PCTest If the SCTest and MCTest fail, Algorithm 2.1 focuses on the infeasibility spherecentred at a generated interior point. If it is shown to be infeasible for at least oneof the constraints, a new infeasibility sphere is generated. The advantage of usingan interior point is that its distance to the farthest vertex is smaller than the largestdistance in between two vertices of the simplex.

ε-Infeasibility

The ε-robustness requirement gives the opportunity to reject a simplex Ck that is closeenough to an infeasible solution. A simplex Ck with an infeasible point x such thatmaxs ‖x− vk,s‖ < ε cannot contain an ε-robust solution.

2.3 New bi-blending rejection rules

Simplices of the working lists are checked on the possibility to contain a (linearly) feasibleand ε-robust solution for a blending problem. Due to bi-blending, we should test whetherthe solutions can be combined with recipes in the other product. For this, one should checkwhether sufficient raw material is available to combine the products and whether alreadycheaper combinations have been found. This is done by the Capacity and Pareto tests.

2.3.1 Capacity test

In the QBB problem, two recipes share raw material. Two mixtures, one for each product,cannot be a valid combination if there is not enough raw material available to producethem.

For each product j, we store a lower bound on the amount of material i that is necessaryto manufacture it:

βLi,j = Dj × min

v∈C∈Λj∪Ωj

vi,

Page 50: High Performance Computing in Deterministic Global ...

36 Branch and Bound applied to the bi-blending problem

where v ∈ C, are the vertices of simplex C. If the vertex which provides the lower boundis removed, all vertices which belongs to a simplex C ∈ Λj ∪ Ωj are examined to find anew lower bound βL

i,j . The examined simplices need not be completely feasible, becausea simplex with all infeasible vertices can contain a solution if it is big enough. Therefore,infeasible vertices can determine the lower bound of material i for product j.

For each raw material i in simplex Ck of the product j, the capacity constraint is satisfiedif

Dj × bLi (Ck) + βL

i,j′ ≤ Bi, (2.13)

where j′ denotes the other product.

2.3.2 Pareto test

Every time a new vertex, satisfying (2.3), (2.4) and (2.5), is generated by branching rule(see Section 2.2.1), it will be combined with all vertices of the other product that also meetindividual restrictions to check the existence of a combination that satisfies the capacityconstraints and improves F U

p . Hence, it is necessary to have a set that contains the feasibleand ε-robust vertices for each product in order to speed up the possible updating of F U

p .If a pair of mixtures x ∈ C ∈ Λ1 ∪ Ω1 and y ∈ C ∈ Λ2 ∪ Ω2 has been found with

D1f(x) + D2f(y) < F Uω(x,y), the value of F U

p is updated for p = ω(x, y), . . . , n, and the pair(x, y) is stored as a valid solution. The value ω(x, y) denotes the amount of distinct rawmaterials involved in x and y, defined by Equation (2.7).

For each group of raw materials u and product j, we store values ϕLu,j corresponding to

a lower bound on the cost to manufacture Dj units:

ϕLu,j = Dj ×min f(v) : v ∈ C ⊆ Pu,j , C ∈ Λj ∪ Ωj . (2.14)

The vector ϕLu,j contains the cost value of the cheapest mixture in the non-rejected simplices

for face Pu,j , u = 1, . . . , 2n − 1. One has to check if ϕLu,j is updated when a new vertex is

generated or deleted.A simplex Ck of product j passes the Pareto test (it is not rejected) if for the other

product j′ there exists a face Pu,j with lower bound ϕLu,j′ such that

DjfL(Ck) + ϕLu,j′ ≤ F U

ω(x,y); x ∈ Ck, y ∈ Pu,j′ . (2.15)

2.4 Final testing after the completion of the algorithm

Lists Ωj , j = 1, 2, contain simplices that have not been thrown out. During the executionof Algorithm 2.1, lower bounds βL

i,j and ϕLu,j are updated based on non-rejected vertices.

Simplex C ∈ Ωj , j = 1, 2, should satisfy ∃C′ ∈ Ωj′ such that

DjfL(C) + Dj′fL(C′) ≤ F Uω(x,y); x ∈ C, y ∈ C′, (2.16)

andbL

i (C) + bLi (C′) ≤ Bi; i = 1, . . . , n. (2.17)

One could use these relations to remove (filter out) those simplices C in the final lists thatcannot contain a part of a Pareto pair.

The result of Algorithm 2.1 is a set of ε-guaranteed Pareto bi-blending recipe-pairsxp with their corresponding costs F U

p , p = 1, . . . , n, and lists Ωj , j = 1, 2, that contain

Page 51: High Performance Computing in Deterministic Global ...

2.5 Iterative-descending B&B strategy 37

Algorithm 2.2 Combination algorithm

1: for j = 1, 2 do2: for all C ∈ Ωj not tagged as valid do3: if ∃C′ ∈ Ωj′ that satisfies (2.16) and (2.17) then4: Tag C′ as valid5: Continue with the next C Remaining C′ ∈ Ωj′ are not visited6: else7: Remove C8: end if9: end for

10: end for

mixtures that have not been thrown out. During the execution of Algorithm 2.1, lowerbounds βL

i,j and ϕLu,j are updated based on non-rejected vertices to discard simplices that

do not satisfy (2.13) or (2.15). These lower bounds are used to avoid expensive computationrelated to the combination of simplices of both products.

Once Algorithm 2.1 has been finished, there may exist simplices in Ωj , j = 1, 2, that donot contain a solution. Algorithm 2.2 combines simplices of the two products in order toreduce the solution area [53], which could be used for finding more accurate Pareto-optimalsolutions. The set of combinations that are left over after running Algorithm 2.2, can beused as input data for a second execution with more accurate results. Section 2.5 elaboratesthis idea.

Algorithm 2.2 checks for each simplex C ∈ Ωj , j = 1, 2, whether it satisfies ∃C′ ∈ Ωj′

such that (2.16) and (2.17) are met. Otherwise, C is rejected. In the first iteration (j = 1)of the outer loop, simplex C′ is marked as valid if it is used to validate (2.16) and (2.17).This means, it will not be checked in the next iteration (j = 2). The computational effortto perform this final test depends on the size of Ωj , j = 1, 2, which could be huge.

2.5 Iterative-descending B&B strategy

The use of global bounds per product for cost and material usage allows the rejection ofsubproblems from one product in Algorithm 2.1. Algorithm 2.2 rejects further, because ittakes subproblem bounds into account instead of product bounds. It requires checking allpossible combinations of subproblems between products.

We investigate a strategy that uses Algorithm 2.2 several times during Algorithm 2.1.This is done using a stepwise descending accuracy value represented by α in the terminationcriterion (width(C) ≤ α) until the desired accuracy value γ is reached. In each descendingstep, Algorithms 2.1 and 2.2 are executed. Algorithm 2.3 shows the process. Data struc-tures and variables are initialized in the same way as lines 1–5 of Algorithm 2.1. Variableα is initialized in line 2 to half the width of the search space. B&B search is performedwith a precision of α and the per-step final lists are filtered, obtaining better lower boundsper product and less simplices in the working lists for the next iteration. This process isrepeated using α := α/2 until α = γ.

In this way, the rejection performed in Algorithm 2.2 results into better global lowerbounds per product which helps the rejection of simplices in Algorithm 2.1. In addition,the rejection of simplices reduces the size of the working and final lists. The computationalcost of updating global lower bounds per product, and the execution time of Algorithm 2.2

Page 52: High Performance Computing in Deterministic Global ...

38 Branch and Bound applied to the bi-blending problem

Algorithm 2.3 Iterative α-descending B&B algorithm for the QBB problem

1: Initialization of lists and variables (see lines 1–5 from Algorithm 2.1)2: α := (

√2− 2md)/2

3: while α 6= γ do4: Perform Algorithm 2.1 with width(C) ≤ α as termination criterion5: Apply Algorithm 2.2 to the final lists6: Λ1 := Ω1; Ω1 := ∅7: Λ2 := Ω2; Ω2 := ∅8: α := max(α/2, γ)9: end while

10: Perform B&B search with width(C) ≤ γ as termination criterion11: Apply Algorithm 2.2 to the final lists12: return xp and F U

p , p = 1, . . . , n, and Ωj , j = 1, 2

are also reduced.

2.6 Evaluation

In this section, Algorithms 2.1, 2.2 and 2.3 are illustrated numerically. The experimentshave been carried out with different instances based on three dimensional and five dimen-sional cases from industry. For each instance, different capacity constraints are used in orderto generate different cases or environments. The specifications (environments) related torobustness and capacity constraints are denoted by Ei (see Appendix A). The effectivenessof the Capacity and Pareto tests is measured. Figures show the search space of the testcases after finishing Algorithm 2.1. We measure the following:

1. Statistics about the amount of simplices and vertices handled by the algorithm.2. Information about the number of discarded simplices and the reason.3. General statistics of the algorithm, such as the running time, and the memory usage.

The used notation is:

• NSimplex: Number of evaluated simplices.• NVertex: Number of evaluated vertices.• NVertexFeas: Number of feasible and ε-robust vertices found during the search.• EndNSimplex: Number of simplices in final lists Ωj , j = 1, 2.• EndNVertex: Number of vertices associated to simplices in Ωj , j = 1, 2.• LC: Number of simplices rejected by Linear infeasibility (2.3).• ε-QInfeas.: Number of simplices rejected by Quadratic infeasibility (2.4) or lack of

any ε-robust mixture (2.5).• Pareto: Number of simplices rejected by Pareto test (2.15).• Capacity: Number of simplices rejected by Capacity test (2.13).• ParetoFilt: Number of simplices filtered by Algorithm 2.2 (2.16).• CapacityFilt: Number of simplices filtered by Algorithm 2.2 (2.17).• TimeBB: The wall-clock time (in seconds) for Algorithm 2.1.• TimeComb: The wall-clock time (in seconds) for Algorithm 2.2.• Memory: The memory requirement of the algorithm.

Page 53: High Performance Computing in Deterministic Global ...

2.6 Evaluation 39

2.6.1 Three-dimensional cases

For the graphical illustration, instances consist of three-dimensional products sharing tworaw materials. Low-dimensional cases are selected for testing and drawing purposes.

Case1 & Case2

Both products are taken from [41]. The first one, called Case1 (j = 1), has two linearconstraints and two quadratic constraints. In [41], Case1 is called RumCoke. The secondone, called Case2 (j = 2), has five quadratic constraints. A detailed description of thesetwo mixtures is given in Appendix A.

The raw materials involved in Case2 (j = 2) are RM1 (i = 1), RM2 (i = 2) andRM3 (i = 3). The raw materials involved in Case1 (j = 1) are RM1, RM2 and RM4.Therefore, the shared components are RM1 and RM2. In contrast to the example describedin Section 2.1.2, component x3,1 refers to RM4 and not to RM3.

We analysed this test problem with the following requirements. The semi-continuitywas set to a minimum dose value of md = 0.03, accuracy of α = ε and the demand of eachproduct is DT = (1, 1). Four environments are designed, varying the robustness and theavailability of raw materials:

E1) Robustness ε =√

2/100. No capacity constraints.E2) Robustness ε =

√2/150. No capacity constraints.

E3) Robustness ε =√

2/150. B1 = 0.86 units available.E4) Robustness ε =

√2/150. B1 = 0.80 units available.

Table 2.1 shows the numerical results obtained from running the algorithm for the setof environments described above. In the table, Case1 is abbreviated as C1, Case2 as C2,and so on. For all environments, the computational time to solve them is less than onesecond.

For E1, no solution exists, because there is no mixture in Case1 with robustness ε =√2/100. The absence of any robust solution does not update F U

p and therefore the Paretotest is not used.

In E2, with robustness required ε =√

2/150, we obtain a solution using four raw ma-terials. As soon as ε-robust mixtures are found, the Pareto test is able to discard largeportions of the search region and the number of vertices evaluated is reduced. On the otherhand, the ε-Infeasibility test is only efficient when simplices are small enough, i.e., they aredeep enough in the search tree. The solution found for E2 is

x4 =(

RM1 RM2 RM3 RM4Case1 0.563203 0.357031 0.079766Case2 0.570312 0.282383 0.147305

)T

and the cost value F (x4) = 0.625305 + 0.402004 = 1.027309. In both products, the maxi-mum amount of RM1 (0.563203 + 0.570312 = 1.133515) is used since this is the cheapestingredient and its availability is not bounded. This means that, for products Case1 andCase2, the cheapest mixtures could be made, i.e., if the capacity constraint is not limited,we can write the bi-blending problem as two separate blending problems.

When the capacity restrictions come into play, the problem is more challenging. In E3,B1 = 0.86 < 1.133515. The number of evaluated vertices increases, because the verticeswith low cost, that are evaluated first according to selection rule (Section 2.2.4), use a large

Page 54: High Performance Computing in Deterministic Global ...

40 Branch and Bound applied to the bi-blending problem

(a) Case1

Zoom 1

(b) Case1 (zoom)

(c) Case2

Zoom 1

(d) Case2 (zoom)

Figure 2.4: Case1 & Case2 with a capacity restriction in RM1 (B1 = 0.86)

Page 55: High Performance Computing in Deterministic Global ...

2.6 Evaluation 41

Table 2.1: Numerical results of three-dimensional cases

Test problem C1 & C2 C3 & C4

Environment E1 E2 E3 E4 –

NSimplex 2,258 1,616 2,822 2,178 3,088NVertex 685 515 874 690 912NVertexFeas 44 77 99 79 528EndNSimplex 363 332 551 406 1,078EndNVertex 255 224 373 280 646

LC 2 2 2 2 0ε-QInfeas. 771 443 837 727 334Pareto 0 38 2 0 135Capacity 0 0 26 27 4

ParetoFilt 0 8 25 0 364CapacityFilt 0 0 320 394 599

TimeBB < 1 s < 1 s < 1 s < 1 s < 1 sTimeComb < 1 s < 1 s < 1 s < 1 s < 1 sMemory 321 KB 250 KB 405 KB 313 KB 548 KB

amount of the cheapest RM1, the scarce raw material. The solution found for E3 is

x4 =(

RM1 RM2 RM3 RM4Case1 0.513438 0.378359 0.108203Case2 0.346367 0.456563 0.197070

)T

and the cost value F (x4) = 0.749008 + 0.551301 = 1.300309. The solution meets thecapacity constraint imposed since 0.513438 + 0.346367 = 0.859805 < 0.86.

Figure 2.4 depicts the search space of E3 after the completion of Algorithm 2.1. Costof the raw materials increases with their index. Each quadratic constraint (2.4) has beenrepresented by a different white dashed line. The final simplices (belonging to Ωj , j = 1, 2)are given in green. The other simplices have been discarded according to the rejection rule(Section 2.2.5). The meaning of each colour of the rejected simplices is given next:

• Grey: Simplices rejected by linear infeasibility.• Blue: Simplices rejected by quadratic infeasibility or lack of robustness.• Brown: Simplices rejected by Pareto dominance.• Orange: Simplices rejected by capacity constraints.

The solution vertices are highlighted by a circle of diameter ε. Each solution pair has adifferent colour (if any): yellow, magenta, etc.

We can observe in Figure 2.4d a small number of large simplices rejected by raw materialshortage due to Capacity test (2.13). If one performs the filtering described in Section 2.4,the number of final simplices rejected by capacity constraints is increased, but this does notimprove the found Pareto solution.

Finally, E4 has no solution due to the scarcity of RM1. The only difference between theE4 and E3 environment is that the available amount of RM1 for E4 is less than for E3.

Page 56: High Performance Computing in Deterministic Global ...

42 Branch and Bound applied to the bi-blending problem

(a) Case3

Zoom 1

(b) Case3 (zoom)

(c) Case4

Zoom 1

(d) Case4 (zoom)

Figure 2.5: Case3 & Case4 with a capacity restriction in RM1 (B1 = 0.86)

Case3 & Case4

The previous test problem has only solutions for four materials. A new test instance whichhas solutions for different number of materials is evaluated. It consists of two products:Case3 and Case4. Both have four quadratic constraints. Case3 makes use of RM1, RM2and RM4; Case4 makes use of RM2, RM3 and RM4. The availability of RM2 is restrictedto 1.35 units, while the others are not limited. More details about these products aredescribed in Appendix A. The minimal dose is set to md = 0.03, robustness is ε =

√2/150,

the demand D = (1, 1)T and accuracy α = ε. Table 2.1 shows the numerical results obtainedby the algorithm. Figure 2.5 shows the search space for this test problem.

In contrast to the previous test problem, two solutions have been found with a different

Page 57: High Performance Computing in Deterministic Global ...

2.6 Evaluation 43

number of raw materials involved: x3 and x4. The first one (drawn in yellow in Figure 2.5)uses three raw materials (RM2, RM3 and RM4):

x3 =(

RM1 RM2 RM3 RM4Case3 0.0 0.823125 0.176875Case4 0.524101 0.374805 0.101094

)T

Its cost value is F (x3) = 1.283688 + 1.146051 = 2.429739. The second one (drawn inmagenta) involves four raw materials:

x4 =(

RM1 RM2 RM3 RM4Case3 0.224609 0.775391 0.0Case4 0.573867 0.349922 0.076211

)T

Its cost value is F (x4) = 0.565234 + 1.056473 = 1.621707.There are two solutions, because the cost value of the three-material solution is bigger

than the cost value of the four-material solution, both solutions are ε-guaranteed Pareto-optimal (2.9), because the algorithm minimizes both the cost and the number of raw ma-terials used.

2.6.2 Five-dimensional cases

A five-dimensional instance is presented to show the effectiveness of the described method.Instances with more dimensions have a memory requirement exceeding the available mem-ory on a conventional machine. We have used a pair of five-dimensional products, calledUni5Spec1 and Uni5Spec5b. Both of them are modifications of two seven-dimensional in-stances (Uni7Spec1 and Uni7Spec5b, respectively) taken from [41] by removing raw mate-rials 6 and 7 from the cases; in Appendix A one can find a complete description of bothproducts. This instance was solved with a robustness ε =

√2/100, an accuracy α = ε,

and a minimal dose md = 0.03. The demand of each product is DT = (1, 1). Capacityconstraints for each environment are:

E1) No capacity constraints.E2) B1 = 0.62 units available.E3) B3 = 0.6 units available.E4) B4 = 0.21 units available.E5) B5 = 0.01 units available.E6) B1 = 0.62 and B3 = 0.6 units available.

Detailed descriptions and the solutions of the instances can be found in Appendix A.The experimental results (showed in Table 2.2) demonstrate that the algorithm is able to

solve medium-size instances on a commodity computer. Notice that in fact we are obtaininga solution of a ten-dimensional planning problem by viewing it as a bi-problem. The mosttime-consuming part of the procedure is related to Algorithm 2.2, as Table 2.2 indicates.Previous experimentation shows that the algorithm is faster if the first list to be filtered isthe largest one. In [43], it is shown that this behaviour is related to cache issues. So, thewall-clock time for each case showed in Table 2.2 results if the algorithm begins with thelargest list.

Page 58: High Performance Computing in Deterministic Global ...

44 Branch and Bound applied to the bi-blending problem

Table 2.2: Numerical results of five-dimensional cases

Test problem U5-1 & U5-5b

Environment E1 E2 E3 E4 E5 E6

NSimplex 2,755,830 2,337,300 2,998,342 1,786,858 106,178 2,536,862NVertex 180,937 156,759 194,540 119,407 14,944 168,186NVertexFeas 3,123 3,092 3,445 2,643 403 3,414EndNSimplex 253,469 233,707 200,528 175,706 16,257 175,660EndNVertex 37,275 33,445 29,390 26,899 4,584 24,861

LC 11,218 11,194 11,218 11,108 837 11,194ε-QInfeas. 1,021,312 825,374 1,090,486 661,694 33,784 876,415Pareto 91,947 77,880 96,015 34,995 2,220 81,331Capacity 0 20,526 100,955 9,957 22 123,862

ParetoFilt 27,569 27,332 27,456 14,782 586 27,284CapacityFilt 0 3,888 100,950 0 0 105,499

TimeBB 6.46 s 5.72 s 7.45 s 4.30 s 0.22 s 6.25 sTimeComb 263.24 s 259.87 s 260.95 s 171.68 s 0.30 s 250.07 sMemory 191 MB 172 MB 215 MB 125 MB 13 MB 193 MB

2.6.3 Seven-dimensional cases

We also run the B&B algorithm on a combination of the original seven-dimensional prob-lems from [41]. Numerical results are provided in Table 2.3. Only Algorithm 2.1 has beenexecuted, because the available memory is not large enough if we want to store the final lists.Therefore, the simplex is discarded when the termination rule applies and Algorithm 2.2is not executed. This explosion in memory requirement provides a challenge for futureresearch on possible ways to handle large data structures in High Performance Computingplatforms.

2.6.4 Experimental results for the iterative-descending algorithm

Table 2.4 shows the numerical results obtained by Algorithm 2.3. The desired accuracy γ isthe same as defined in previous sections. The table is divided horizontally in two parts: thefirst part gathers all the three-dimensional cases and the second one the five-dimensionalones. Due to the high computational burden of seven-dimensional problems, they have notbeen considered here. Headers of Table 2.4 show:

NEvalS: number of evaluated simplices.NEvalV: number of evaluated vertices.NFiltS: number of simplices filtered by Algorithm 2.2.Time: execution time in seconds.

It can be observed that for all instances, the number of evaluated simplices is largercompared to data in Tables 2.1 and 2.2. The reason is that the new selection criterion canbe seen as a combination of breadth-first and hybrid search; it only progresses a few levels

Page 59: High Performance Computing in Deterministic Global ...

2.7 Summary 45

Table 2.3: Numerical results of of seven-dimensional cases

Test problem U7-1 & U7-5b

Environment E1 E2 E3 E4 E5 E6

NSimplex 5,070 · 106 5,083 · 106 5,278 · 106 10,096 · 106 55 · 106 5,290 · 106

NVertex 59 · 106 57 · 106 61 · 106 55 · 106 1 · 106 59 · 106

NVertexFeas 1 · 106 1 · 106 1 · 106 1 · 106 31,575 1 · 106

LC 13 · 106 14 · 106 14 · 106 51 · 106 170,389 14 · 106

ε-QInfeas. 1,472 · 106 1,438 · 106 1,522 · 106 2,611 · 106 22 · 106 1,486 · 106

Pareto 5 · 106 5 · 106 14 · 106 0 138,535 3 · 106

Capacity 0 4 · 106 78 1 · 106 95 5 · 106

TimeBB 50,622 40,920 56,375 35,686 223,54 42,189Memory 4.21 GB 4.18 GB 4.19 GB 2.91 GB 0.94 GB 4.17 GB

Table 2.4: Numerical results for the iterative-descending method

NEvalS NEvalV NFiltS Time

C1 & C2 E1 2,258 685 0 < 1E2 2,208 679 8 < 1E3 1,958 608 156 < 1E4 1,042 349 67 < 1

C3 & C4 1,538 490 248 < 1

U1 & U5 E1 3,170,698 182,170 6,118 154E2 2,881,578 165,733 24,369 165E3 2,742,576 159,874 154,228 172E4 1,840,030 116,872 11,213 160E5 112,614 15,420 444 < 1E6 2,396,492 139,613 168,660 156

in the search tree. Notice that the number of evaluated vertices decreases with the newstrategy. This means that the number of shared vertices increases.

The number of simplices filtered out by Algorithm 2.2 is similar to that obtained inTables 2.1 and 2.2, but the number of checked combinations is less than half the number(these data do not appear in the table). Now simplices can be filtered in earlier iterationsof Algorithm 2.3. This also reduces considerably the memory management. Smaller-sizedata structures cause less computational effort.

2.7 Summary

This chapter presents several procedures for solving mixture problems for two productswith design and capacity constraints, the bi-blending problem. The B&B algorithm workswith partition sets in both design spaces. New tests on Pareto optimality and capacity

Page 60: High Performance Computing in Deterministic Global ...

46 Branch and Bound applied to the bi-blending problem

constraints have been developed and implemented. The B&B algorithm has been usedto obtain the ε-guaranteed Pareto optimal solutions. Algorithm 2.1 provides final lists ofsimplices containing the solutions. In order to reduce these regions, Algorithm 2.2 performsa exhaustive filtering of these final lists in order to discard simplices that are not feasibleaccording to the Pareto optimality and capacity constraints.

The numerical experiments showed that the size of the data structures is an importantfactor to take into consideration. Large structures may produce cache misses at runningtime, slowing down the execution. Therefore, as in all B&B algorithms, it is desirable toreject subproblems as soon as possible. The iterative-descending B&B strategy does so forthe QBB problem.

Results have been shown for problems with two products, laying the foundations forsolving problems with a larger number of products, the multi-blending problem.

Page 61: High Performance Computing in Deterministic Global ...

CHAPTER

3Parallelization of the

bi-blending algorithm

In this chapter, we study the parallelization of the different phases of the sequential bi-blending algorithm and focus on the most time consuming phase, analysing the performanceof several strategies.

3.1 Parallel strategy

The process described in Chapter 2 to solve the quadratic bi-blending problem has two in-dependent phases: the Branch and Bound (B&B) phase (Algorithm 2.1 of Chapter 2) pro-vides lists Ω1 and Ω2 with simplices that reached the termination criterion; the Combinationphase (Algorithm 2.2) filters out simplices without solutions. The computational character-istics of Algorithms 2.1 and 2.2 are completely different. While Algorithm 2.1 works withirregular data structures, Algorithm 2.2 works with more regular ones. Algorithm 2.2 isrun after finishing Algorithm 2.1. Hence, parallel models of both algorithms are analysedseparately.

The number of final simplices of Algorithm 2.1 depends on several factors: the dimen-sion, the accuracy α of the termination rule, the feasible region of the instances to solve,etc. Preliminary experimentation shows that this number of final simplices can be rela-tively large. Algorithm 2.2 is computationally much more expensive than Algorithm 2.1.Therefore, we first study the parallelization of Algorithm 2.2.

Algorithm 2.2 uses a nested loop, and two lists Ω1 and Ω2. For each simplex C ∈ Ωj ,a simplex C′ ∈ Ωj′ must be found that satisfies (2.16) and (2.17) to keep C on the list.In the worst case (when the simplex can be removed), list Ωj′ is explored completely (allsimplices C′ ∈ Ωj′ are examined).

Let us introduce Pos(C, Ωj) to represent the position of the simplex C in Ωj , NT hdenotes the total number of threads and T h is the number of the thread T h = 0, . . . , NT h−1. There are several ways to build a parallel threaded approach of Algorithm 2.2. In thischapter we deal with two strategies:

47

Page 62: High Performance Computing in Deterministic Global ...

48 Parallelization of the bi-blending algorithm

Strategy 1: applying NT h/2 threads to each list Ωj , j = 1, 2. Thus, iterations 1 and 2 of theouter loop are performed concurrently. This strategy requires NT h ≥ 2. Each threadT h checks simplices in C ∈ Ωj : Pos (C, Ωj) mod (NT h/2) = T h mod (NT h/2).After exploring both lists, the deletion of the simplices is performed by one threadper list Ωj .

Strategy 2: applying NT h threads at the inner loop to perform an iteration of the outerloop. Each thread T h checks simplices C ∈ Ωj that meet Pos (C, Ωj) mod NT h = T h.Now, the idea is to check just one list in parallel, removing non-feasible simplices beforeexploring the other list. Deletion of the simplices (tagged for this purpose) is onlyperformed by one of the threads at the end of each iteration j.

To avoid contention between threads in the Ωj exploration, simplices are not deleted (line 7of Algorithm 2.2) but tagged to be removed. Otherwise, the list can be modified by severalthreads, when simplices are removed, requiring the use of mutual exclusion. This is adisadvantage for Strategy 1, because the number of comparisons between simplices is biggerthan that of Strategy 2, which removes the non-valid simplices after the first iteration. Forthe second iteration, the number of simplices in the final lists is reduced and therefore thenumber of combinations as well.

A difficulty of parallelizing Algorithm 2.1 is that the pending computational work forthe B&B search of one product is not known beforehand, i.e., it is an irregular algorithm.The search in one product is affected by the shared information with the other product.Moreover, the computational cost of the search in each product can be quite different dueto the different design requirements. A study on the prediction of the pending work inB&B Interval Global Optimization algorithms can be found in [9]. Although the authorsdescribe their experience in B&B parallel algorithms [14, 26, 60, 83], these papers tackleonly one B&B algorithm. QBB actually uses two B&B algorithms, one for each product,sharing βL

i,j , ϕLu,j and F U

p (see Equations (2.13) and (2.15)). The problem is to determinehow many threads to assign to each product if we want that both parallel B&B executionsspend approximately the same (or similar) computing time. This will be addressed in afuture study. Preliminary results show that the B&B phase is computationally negligiblewhen compared to Combination phase. Therefore, we will use just one static thread perproduct. This allows us to illustrate the challenge of load balancing.

3.2 Experimental results

To evaluate the performance of the parallel version of Algorithms 2.1 and 2.2, we haveused the instances investigated in Chapter 2. The algorithms were coded in C and POSIXThreads API was used to create and manipulate threads. Previous studies, as those pre-sented in [83], show a less than linear speedup using OpenMP for B&B algorithms. Nev-ertheless, a study on the parallelization of the combinatorial phase with OpenMP pragmasmay be addressed in the future.

3.2.1 B&B phase in parallel

Table 3.1 provides information about the computational work performed by the parallelversion of Algorithm 2.1 in terms of the number of evaluated simplices (NEvalS) and vertices(NEvalV). In addition, performance metrics like wall-clock time in seconds for thread 1 and2, and the achieved speedup are provided. For three-dimensional cases, the parallelizationis not necessary due to the low wall-clock time required by the sequential Algorithm 2.1.

Page 63: High Performance Computing in Deterministic Global ...

3.2 Experimental results 49

Table 3.1: Computational effort of the B&B phase using two working threads

NEvalS NEvalV Time th. 1 Time th. 2 Speedup

C1 & C2 E1 2,258 685 < 1 < 1 –E2 1,730 551 < 1 < 1 –E3 2,780 862 < 1 < 1 –E4 2,252 713 < 1 < 1 –

C3 & C4 3,088 912 < 1 < 1 –

U5-1 & U5-5b E1 2,755,830 180,937 0.48 6.19 1.03E2 2,337,768 156,846 0.52 5.59 1.03E3 2,998,342 194,540 0.52 7.39 1.01E4 1,786,858 119,407 0.48 4.06 1.07E5 107,160 15,203 0.04 0.21 1.00E6 2,537,430 168,299 0.51 6.18 1.02

U7-1 & U7-5b E1 7,488,323,768 67,538,740 29,549,81 29,507.38 1.71E2 6,873,369,016 64,406,706 24,177,11 24,162.52 1.69E3 7,596,754,748 68,459,252 31,628,45 31,594.87 1.78E4 10,096,582,142 55,235,505 36,290,96 201.43 0.98E5 59,240,690 1,873,272 112,49 112.31 1.99E6 7,117,486,484 66,249,133 25,131,01 25,111.68 1.68

Nevertheless, the parallel algorithm has been executed to see the possible occurrence ofsearch anomalies, reflected in the number of evaluated simplices/vertices (see Section 1.2.5).Differences between the sequential and parallel execution are negligible for three and five-dimensional cases, so there is no detrimental neither incremental anomalies in the B&Bphase. For five-dimensional cases, the execution time is neither reduced nor increased dueto the difference in feasible area between both products: Uni5Spec1 has simpler quadraticrequirements compared to Uni5Spec5b; thread T h = 1 only spends less than a secondon exploring the entire search space of Uni5Spec1, while thread T h = 2 spends severalseconds to finalize the search space exploration of Uni5Spec5b. For seven-dimensionalcases, a reduction is achieved although the number of evaluations is greater than thosein the sequential execution. Notice that the number of used threads is two, limiting thespeedup. Future versions of the algorithm can use more threads per product.

3.2.2 Combination phase in parallel

Table 3.2 shows the speedup of the parallel versions of Algorithm 2.2 for 2, 4, 8, and 16threads by applying the strategies described in Section 3.1. In order to make a fair com-parison, the lists generated by the sequential B&B phase have been used as input for theparallel version of Algorithm 2.2.

According to the results shown in Table 3.2, Strategy 2 gives the best performance. AsSection 3.1 commented, Strategy 1 will never achieve a linear speedup, because the workperformed by this strategy is greater than the work performed by Strategy 2. Nevertheless,the scalability is good when the number of threads is increased. In addition, Strategy 1has threads working on elements of Ω1 and Ω2, where for each element of one list, the

Page 64: High Performance Computing in Deterministic Global ...

50 Parallelization of the bi-blending algorithm

Table 3.2: Speedup obtained in the Combination phase

Strategy 1 Strategy 2

2 th. 4 th. 8 th. 16 th. 2 th. 4 th. 8 th. 16 th.

E1 0.9 2.0 3.9 7.4 2.0 4.0 5.3 11.7E2 0.9 2.0 3.9 7.2 2.0 4.0 7.6 11.7E3 0.7 2.0 3.5 5.6 2.0 4.0 5.6 12.3E4 1.0 2.0 4.0 7.6 2.0 4.0 7.6 11.6E6 0.7 2.0 3.5 5.5 2.0 4.0 5.5 12.2

comparison is done with elements of the other list until a valid combination is found orthe complete list has been checked (the worst case). This requires that many elements ofboth lists have to be cached, causing cache misses [43]. On the other hand, Strategy 2 usesall threads for checking elements of one list. In this way, only the number of elements onthe current list (equal to NT h) and the elements of the other list have to be in cache forcomparison. This reduces cache misses and therefore the running time, even more whenthe other list is small in size.

3.3 Summary

A parallelization of an algorithm to solve the bi-blending problem has been studied fora small-medium size instance of the problem. This single case illustrates the difficultiesof this type of algorithms. Bi-blending increases the challenges of the parallelization of aB&B algorithm compared to single blending, because it actually runs two B&B algorithmsthat share information. Additionally, in bi-blending algorithms, a combination of finalsimplices has to be done after the B&B phase to discard regions without a solution. Thiscombination phase can be computationally several orders of magnitude larger than the B&Bphase. Here we use just one thread for each product in the B&B phase and several threadsfor the combination phase.

Page 65: High Performance Computing in Deterministic Global ...

CHAPTER

4Simplicial Branch and

Bound applied to GlobalOptimization

This chapter focuses on simplicial Branch and Bound (B&B) algorithms where theinitial search space is an n-dimensional hyper-cube partitioned into a set of non-overlappingsimplices. For some problems like mixture design, the search space is a regular simplex (seeChapter 2). Here, we focus on box-constrained problems. Three B&B rules will be analysedin this chapter: Branching rule (Section 4.1.2), Bounding rule (Section 4.1.3), and Selectionrule (Section 4.1.4).

Recent studies show an interesting improvement in the number of generated sub-simpliceswhen a different heuristic is applied in the iterative refinement of a regular n-simplex [5, 4, 3].In these studies, the complete binary tree is built by bisecting the longest edge of a sub-simplex until the width, determined by the length of the longest edge, is smaller or equalto a given accuracy ǫ. This chapter studies the effect of similar heuristics in a B&B algo-rithm applied to solve the multidimensional Global Optimization (GO) problem, where theshape of the binary tree is also determined by the rejection rule using several values of theaccuracy.

Two different bounding methods will be used in the experimentation. The aim is toobtain a general vision of the performance of the Branching and Selection rules. In thisway, the experimental results are less problem-specific. The effectiveness of each boundingrule is out of scope. Both bounding rules are based on structural information given for eachinstance. One constant is called the Lipschitz constant and the other is called in this studythe Baritompa (upper fitting) constant.

The B&B tree can be traversed in several ways. Several selection rules will be studiedfrom a computational point of view, analysing the memory consumption and computationaltime. Several rules may seem less efficient than others regarding the number of generatednodes. The data structure and the number of elements in main memory play an importantrole in the performance of the code.

51

Page 66: High Performance Computing in Deterministic Global ...

52 Simplicial Branch and Bound applied to Global Optimization

4.1 Simplicial B&B method for multidimensional GO

We focus on the multidimensional box-constrained GO problem. The goal is to find at leastone global minimum point x∗ of

f∗ = f(x∗) = minx∈X

f(x), (4.1)

where the feasible area X ⊂ Rn is a nonempty box-constrained area, i.e., it has simple upperand lower bounds for each variable. One of the bounds is based on the concept of Lipschitzcontinuity. Function f is said to be Lipschitz-continuous on X if there exists a maximumslope L called the Lipschitz constant such that |f(x) − f(y)| ≤ L‖x − y‖, ∀x, y ∈ X , see[49]. The norm ‖ · ‖ is usually taken as Euclidean. However, the generation of bounds alsoallows other norms. A possible application of Lipschitzian optimization can be the issue offinding feasible mixture designs [39]. Other applications can be found in [78]. Another usedbound is based on observations of Bill Baritompa [7], where the function is not requiredto be differentiable nor (Lipschitz) continuous. It uses a constant K that we will call aBaritompa constant.

We will see in this section how one can subdivide the search space and derive simplebounds on the simplicial subsets. All ingredients are then collected into an algorithm.

Algorithm 4.1 Simplicial B&B algorithm, bisection

Require: X, f, K, δ1: Partition X into simplices Sk, k = 1, . . . , n!2: Start the working set as Λ := Sk : k = 1, . . . , n!3: The set of evaluated vertices V := vi ∈ Sk ∈ Λ4: Set fU := minv∈V f(v) and xU := arg minv∈V f(v)5: Determine lower bounds fL

k = fL(Sk) based on K6: while Λ 6= ∅ do7: Extract a simplex S from Λ8: Bisect S into S1 and S2 generating x9: if x /∈ V then

10: Add x to V11: if f(x) < fU then12: Set fU := f(x) and xU := x13: Remove all Sk from Λ with fL

k > fU − δ14: end if15: end if16: Determine lower bounds fL(S1) and fL(S2)17: Store S1 in Λ if fL(S1) ≤ fU − δ18: Store S2 in Λ if fL(S2) ≤ fU − δ19: end while20: return xU , fU

Algorithm 4.1 introduces an overview of the B&B method to solve the multidimensionalGO problem. It uses explicitly simplicial partition sets that are bisected over the longestedge to generate new points to be evaluated. It guarantees to find an δ-approximation xU

of the minimum point x∗ such that f(xU ) ≤ f∗ + δ.

Page 67: High Performance Computing in Deterministic Global ...

4.1 Simplicial B&B method for multidimensional GO 53

Figure 4.1: Division of a hypercube into six irregular simplices

4.1.1 Initial space

In GO, a feasible region is usually box-constrained, i.e., the feasible region is a hyper-rectangle. Therefore, most B&B methods use hyper-rectangular partitions. However, othertypes of partitions may be more suitable for some optimization problems. Compared to theuse of rectangular partitions, simplicial partitions are convenient when the feasible region isa polytope [75]. Optimization problems with linear constraints are examples where feasibleregions are polytopes which can be vertex triangulated.

For the use of simplicial partitions, the feasible region is partitioned into simplices. Thereare two methods: over-covering and face-to-face vertex triangulation. The first strategycovers the hyper-rectangle by one simplex, that can be a regular one. The disadvantageof this method is that the search space is bigger than the feasible area and some regionscan be out of the function definition. The most preferable initial covering is face-to-facevertex triangulation. It involves partitioning the feasible region into a finite number ofn-dimensional simplices with vertices that are also the vertices of the feasible region. Astandard method is triangulation into n! simplices [87]. All simplices share the diagonalof the feasible region and have the same hyper-volume. Figure 4.1 depicts a hypercube ofdimension three partitioned into six irregular simplices.

4.1.2 Branching rule

Literature discusses many methods to subdivide a simplex [49]. One of them is the LongestEdge Bisection (LEB), which is a popular way of iterative division in the finite elementmethod, since it is very simple and can easily be applied in higher dimensions [37]. Thismethod consists of splitting a simplex using the hyperplane that connects the middle pointof the longest edge of a simplex with the opposite vertices. This is illustrated in Figure 4.2which also shows that, in higher dimensions, there can exist several longest edges. Forour study, we should notice that due to the initial partition as sketched in Figure 4.1, forn = 3 the longest edge is unique in all generated subsets. This means that to observe whathappens with a choice of the longest edge to the search tree, we should focus on dimensionshigher than 3. We formulate several rules to select the longest edge. The most commonedge selection rule in LEB is the following:

LEB1: Natural coding implicitly selects a longest edge being the first one found. Thesequence depends on the coding and storing of the vertices and edges, i.e., the index

Page 68: High Performance Computing in Deterministic Global ...

54 Simplicial Branch and Bound applied to Global Optimization

Longest

edges

Figure 4.2: Longest Edge Bisection, denoting a sub-simplex with three longest edges

number assigned to each vertex of the simplex. When a simplex is split into two newsub-simplices, the new vertex of each sub-simplex has the same index as the one itsubstitutes.

Our preliminary studies show the existence of many sub-simplices having more than onelongest edge when LEB is used as iterative partition rule in a simplicial B&B algorithm.

In order to reduce the search tree size, other heuristics for selecting the longest edge inthe division of a regular n-simplex are investigated to be used in simplicial B&B algorithms.They are summarized below:

LEBα: For each vertex in a longest edge, the sum of the angles between edges ending atthat vertex is determined and the longest edge corresponding to the smallest sum isselected. Example 1 shows the application of this rule to a 3-simplex.

Example 1 For a tetrahedron with vertices vi, i = 1, 2, 3, 4, the value of the edge v2− v1 isgiven by the sum of the angles with vertex v1 (v2v1v3, v3v1v4, and v2v1v4) and with vertexv2 (v1v2v3, v1v2v4, and v3v2v4) calculated using trigonometric identities.

LEBC : Bisects the longest edge with the largest distance from its middle point to thecentroid of the simplex.

LEBM : Determines the distance from a longest edge midpoint to the other vertices. Itthen selects that longest edge that has the maximum sum of distances to the othervertices.

LEBW : Selects an edge that has not been involved in many bisections yet via a weightsystem. The initial set of evaluated vertices (line 15 of Algorithm 4.1) are set towi := 0. A new vertex vi (generated by the branching rule, line 20 of Algorithm 4.1)is initiated with weight wi := 1. Each time vertex vi belongs to a divided edge,its weight is updated to wi := wi + 1 mod n. For each longest edge defined byvertices (vi, vj), the two weights (wi, wj) are summed and the one with smallest sumis selected.

One of the research questions in our investigation is to determine a LEB rule that minimizesthe search tree generated by a simplicial B&B algorithm measured as the total number ofgenerated nodes (sub-simplices).

Page 69: High Performance Computing in Deterministic Global ...

4.1 Simplicial B&B method for multidimensional GO 55

4.1.3 Bounding rule

The determination of a lower bound is an important computational step. Therefore, westudy which calculations are involved.

Baritompa constants

Consider the objective function f with a global minimum f∗ on box-constrained area X .Given a global minimum point x∗, let scalar K be such that

K ≥ maxx∈X

|f(x) − f∗|‖x− x∗‖ , (4.2)

where ‖ · ‖ denotes the Euclidean norm. Although this is not essential, we will work withEuclidean distance. The function f∗ + K‖x−x∗‖ is an upper fitting according to [7] for anarbitrary x ∈ X . Consider a set of evaluated points xi ∈ X with function values fi = f(xi),then the area below

ϕ(x) = maxifi −K‖x− xi‖ (4.3)

cannot contain the global minimum (x∗, f∗). Let fU = mini fi be the best function valueof all evaluated points, i.e., an upper bound of f∗. Then the area x ∈ X : ϕ(x) > fUcannot contain the global minimum point x∗.

Now consider a simplex S with evaluated vertices v0, v1, . . . , vn, where fi = f(vi). Todetermine the existence of optimal solution x∗ in S, each evaluated vertex (vi, fi) providesa cutting cone:

ϕi(x) := fi −K‖x− vi‖. (4.4)

Let Φ be defined byΦ(S) = min

x∈Smax

iϕi(x). (4.5)

If fU < Φ(S), then simplex S cannot contain the global minimum point x∗, and thereforeS can be rejected. Notice that Φ(S) is a lower bound of f∗ if S contains the minimum pointx∗.

Equation (4.5) is not easy to determine as shown by Mladineo [67]. Therefore, alterna-tive lower bounds of (4.5) can be generated in a faster way. We use two of them and takethe best (highest) value.

An easy-to-evaluate case is to consider the best value of minx∈S ϕi(x) over the vertices i.This results in a lower bound

Φ 1(S) = maxifi −K max

j‖vj − vi‖. (4.6)

The second lower bound is based on the more elaborate analysis of infeasibility spheresin [13] and developed to non-optimality spheres in [40]. It says that S cannot contain anoptimal point if it is covered completely by so-called non-optimality spheres. According to[13], if there exists a point c ∈ S such that

fi −K‖c− vi‖ > fU i = 0, . . . , n, (4.7)

then S is completely covered and cannot contain x∗. This means that any interior point cof S provides a lower bound minifi −K maxj ‖c− vi‖. Instead of trying to optimize the

Page 70: High Performance Computing in Deterministic Global ...

56 Simplicial Branch and Bound applied to Global Optimization

lower bound over c, we generate an easy-to-produce weighted average based on the radii ofthe spheres. Consider that fi > fU , otherwise S can contain an optimum point. Let

λi =K

fi − fU, (4.8)

and takec =

1∑

j λj

i

λivi. (4.9)

A second lower bound based on (4.7) is

Φ 2(S) = minifi −K‖c− vi‖. (4.10)

The final lower bound we consider for the B&B is the best value fL(S) = maxΦ 1(S), Φ 2(S).

Lipschitz constants

Let v0, v1, . . . , vn be vertices of a simplex S. In our study, the Lipschitz constants aregiven a priori. Like in [74], we may consider the maximum slope of a differentiablefunction based on several norms: L = maxx∈X ‖∇f(x)‖2, L1 = maxx∈X ‖∇f(x)‖1, orL∞ = maxx∈X ‖∇f(x)‖∞. The basis of the lower bound is that for each vertex vi we havea lower bounding function

ϕi(x) := f(vi)− L‖x− vi‖ ≤ f(x).

For the multi-dimensional case, researchers looked for the easy computational determinationof

minx∈S

maxi

ϕi(x), (4.11)

where the objective function is in general neither convex nor concave. In [67], a formalisationof (4.11) is given for the Euclidean norm and it is shown that (4.11) is not easy to determine.This result made [49] claim that Lipschitz optimization “does not look very practical”.However, [61] shows that if we use the 1-norm on a box shaped partition set, i.e.,

ϕi(x) := f(vi)− L∞‖x− vi‖1,

then (4.11) can be formulated as a linear program. Far more recently, an elaboration forthe 1-norm was extended to simplicial partition sets in [73], where the determination of thelower bound

LBB(S) = minx∈S

maxi

(f(vi)− L∞‖x− vi‖1),

implies the solution of a specific set of linear equations in n variables. Following [74], wecombine this bound with the bounds

LBF (S) = maxi

(f(vi)− L1 maxj‖vj − vi‖∞)

andLBE(S) = max

i(f(vi)− L max

j‖vj − vi‖2).

Summarizing, computationally the determination of the lower bound

LB(S) = maxLBB(S), LBF (S), LBE(S)uses the function values in the vertices of S, determines the lengths of all edges according tothe 1-norm, 2-norm and∞-norm, computes a set of n linear equations, and finally comparesvalues.

Page 71: High Performance Computing in Deterministic Global ...

4.2 Evaluation 57

4.1.4 Selection rule

Among the selection rules discussed in Section 1.2.3, we will empirically study two selectionmethods: depth-first search and a hybrid search. A wide study on selection strategies forthis type of algorithms can be found in [76].

Depth-first search selects the best simplex among those generated in the last division,until both generated simplices are rejected. In the latter case, the method extracts a simplexfrom the LIFO (Last In First Out) stack.

Hybrid search is a combination of depth-first and best-first search. The best simplex,i.e., that with the lowest bound, is selected from the working set and a depth-first searchis done by selecting the best simplex among those generated in the last division, until nofurther division is possible. In general, depth-first search is used to reduce the memoryrequirement of the algorithm.

4.1.5 Rejection rule

A simplex S is rejected iffL(S) > fU − δ.

4.2 Evaluation

Two sets of test functions will be used for experimentally analysing the performance ob-tained by changing the Branching and the Selection rules. The first set is solved usingthe bounding rule based on Baritompa constants. The second set is solved using Lipschitzconstants. Both bounding rules have been described in Section 4.1.3. Each test function isreferenced by an identifier (Id.) composed by a letter and a number. The letter indicatesto which set belongs (‘B’ for Baritompa and ‘L’ for Lipschitz).

The first set has been built to measure the tree generated by the set of LEB strategiesdiscussed in the previous section. We will study if the dimensionality of the problem plays arole in the performance of the used LEB strategy. A complete suite of test functions can befound in [52]. From this set, we selected a subset of functions that allow varying the dimen-sion of the problem. We remind the reader that the often-used low dimensional instancesare not appropriate to measure the difference of the generated tree as for dimensions n ≤ 3there is no choice on the selected longest edge to be bisected.

For each test function, at least one global minimum point is known and we determinedthe sharpest value of parameter K in Equation (4.2) that we could find using a multistartapproach. For instances like MaxMod or Zakharov, a value for K can be determined an-alytically. The data of the corresponding test-bed are given in Table 4.1. A descriptionof the test instances with the considered minimum point and the function range [f∗, f ] onthe given domain is provided in Appendix B, where f is the maximum function value onthe domain. The depth of the generated B&B tree is mainly determined by the accuracyδ. To obtain reasonable size trees, in the experiments the value of the accuracy δ is set onδ = 0.05 (f − f∗).

The second set of problems used for the evaluation are taken from [75]. Table 4.2summarizes the most important characteristics of each problem. More details of this set isprovided in Appendix B.

For the test-bed using Lipschitz constants, four-dimensional problems have been solvedwith an accuracy of δ = 1

4 L2, δ = 34 L2 has been used for five-dimensional instances and

δ = 64 L2 for the six-dimensional instance.

Page 72: High Performance Computing in Deterministic Global ...

58 Simplicial Branch and Bound applied to Global Optimization

Table 4.1: Test instances for dimension n = 4, 5, 6, and the corresponding Kn values

Id. Test problem K4 K5 K6 Domain

B01 Ackley 5.4 4.9 4.4 [−30, 30]n

B02 Dixon & Price 21,646.5 29,086.1 37,248.8 [−10, 10]n

B03 Holzman 5,196.2 7,000.0 9,000.0 [−10, 10]n

B04 MaxMod 1.0 1.0 1.0 [−10, 10]n

B05 Perm 183,998.1 31,159,684.8 7,746,536,437.2 [−n, n]n

B06 Pinter 60.0 75.3 88.7 [−10, 10]n

B07 Quintic 29,712.0 33,219.0 36,389.7 [−10, 10]n

B08 Rastrigin 91.8 102.7 112.5 [−5.12, 5.12]n

B09 Rosenbrock 168,005.3 190,517.7 210,669.4 [−5, 10]n

B10 Schwefel 1.2 176.8 176.8 444.6 [−10, 10]n

B11 Zakharov 312,645.0 1,415,285.7 4,962,758.1 [−5, 10]n

Table 4.2: Test functions using Lipschitz bounds

Id. Test problem n L1 L2 L∞ Domain

L01 Levy No. 15 4 1, 273.2 1, 196.4 1, 195.5 [−10, 10]4

L02 Shekel 5 4 204.1 102.4 56.1 [0, 10]4

L03 Shekel 7 4 300.1 151.5 86.1 [0, 10]4

L04 Shekel 10 4 408.2 204.5 110.8 [0, 10]4

L05 Schwefel 1.2 4 600 313.69 200 [−5, 10]4

L06 Powell 4 92, 216 48, 251.5 29, 270 [−4, 5]4

L07 Levy No. 9 4 26.1 14.4 8.3 [−10, 10]4

L08 Levy No. 16 5 422.93 370.66 369.68 [−5, 5]5

L09 Levy No. 10 5 34.375 16.56 8.25 [−10, 10]5

L10 Levy No. 17 6 421.7 358.6 357.4 [−5, 5]6

4.2.1 Comparison of selection strategies

For each test problem, Table 4.3 shows the computational burden in terms of number ofevaluated simplices, the maximum size of the working set, and the wall-clock time. Noticethat the working set is an AVL tree when the selection rule is hybrid (best-depth) search,and a stack in case of the selection rule is depth-first search. The used accuracy is low due todemonstrative purposes. Instances solved using Baritompa constants has a dimensionalityof n = 5. Instance B08 is not considered here due to its huge workload compared with therest of the problems.

If the hybrid search is used, the number of elements stored in the tree is huge comparedto the number of elements stored using depth-first search. The computational difficultyto manage a tree with millions of elements results in an increased execution time. Thenumber of evaluated simplices in both strategies is similar. As explained in Chapter 1, thedifference is caused by a different moment a sharp upper bound on the minimum objectivefunction value is found.

Page 73: High Performance Computing in Deterministic Global ...

4.2 Evaluation 59

Table 4.3: Comparison between hybrid and depth-first search

Hybrid search Depth-First searchNo. N. Eval. S. Max. size Time N. Eval. S. Max. size Time

B01 1,010,945,400 89,444,760 1,370 1,010,945,400 147 1,152B02 123,575,854 9,185,978 168 123,575,850 141 95B03 219,996,634 18,238,445 332 219,996,634 142 239B04 1,877,094,680 127,939,132 1,783 1,877,094,680 144 1,378B05 166,831,436 13,206,138 565 166,831,502 140 442B06 989,052,844 62,290,233 1,790 989,052,844 146 1,359B07 261,104,286 21,716,163 248 261,009,818 143 203B09 175,616,936 14,646,341 271 175,613,436 142 135B10 87,628,502 6,809,172 97 87,628,502 142 67B11 603,676,348 52,434,838 1,004 603,678,276 139 466

L01 893,432,408 73,249,122 1,592 893,936,704 48 872L02 169,766,312 16,039,956 181 173,657,364 46 141L03 293,153,214 33,771,421 385 295,477,634 46 242L04 293,864,708 31,365,461 390 295,300,328 46 249L05 189,373,530 19,080,336 227 193,891,104 45 153L06 141,117,980 13,301,723 186 141,147,256 44 111L07 19,356,694 1,692,164 26 19,460,336 46 18L08 217,902,776 18,824,739 622 217,880,340 140 483L09 122,957,954 10,577,950 338 122,890,838 142 262L10 322,972,838 22,439,130 2,857 322,865,034 740 2,631

4.2.2 Comparison of the LEB strategies

Based on the performance results of previous section, depth-first search is used from nowon. For the Lipschitz test-bed, we sharpened the accuracy used in the experimental resultsusing an accuracy of δ = 1

8 L2 for four-dimensional instances, δ = 38 L2 for five-dimensional

instances and δ = 128 L2 for the six-dimensional instance. Regarding the Baritompa test-

bed, several six-dimensional instances are not considered, because the computational burdenmakes the execution not practical due to the long wall-clock time.

Table 4.4 shows the numerical results obtained when LEB1 is applied. The computa-tional effort is captured in terms of the number of generated and evaluated simplices inthe corresponding B&B tree, see lines 17 and 28 of Algorithm 4.1. Data in column NLErepresents the percentage of the divided sub-simplices which contains two or more longestedges when the LEB1 rule is used. The number of simplices having more than one longestedge is large, above 55% except for the B08 instance. This percentage induces the researchquestion of which longest edge must be divided in order to generate the most efficient LEselection strategy (see Section 1.5). Here, dimensions 4, 5, and 6 are considered. Exper-imental results for three-dimensional problems are not shown in the table, because thenumber of evaluated simplices are the same for all LEB rules.

Table 4.5 shows the numerical results for a search domain defined in a four-dimensional

Page 74: High Performance Computing in Deterministic Global ...

60 Simplicial Branch and Bound applied to Global Optimization

Table 4.4: Experimental results using LEB1

Id. Dimension 4 Dimension 5 Dimension 6N. Eval. S. NLE N. Eval. S. NLE N. Eval. S. NLE

B01 8,467,608 71% 1,010,945,400 67% – –B02 2,312,132 66% 123,575,850 60% 8,203,852,060 82%B03 3,419,030 67% 219,996,634 61% 16,628,924,978 80%B04 5,742,672 63% 1,877,094,680 55% – –B05 39,987,438 66% 166,831,502 69% – –B06 4,896,640 69% 989,052,844 62% – –B07 2,527,376 66% 261,009,818 59% 24,424,636,700 81%B08 117,620,808 18% 23,085,565,464 19% – –B09 2,806,400 69% 175,613,436 64% 11,689,814,082 79%B10 1,857,322 69% 87,628,502 65% 5,154,080,906 81%B11 5,299,018 72% 603,678,276 68% – –

L01 5,593,162,238 59%L02 619,816,602 62%L03 1,312,244,486 59%L04 1,331,168,300 59%L05 1,363,992,350 60%L06 1,568,043,214 61%L07 79,541,580 60%L08 3,688,449,152 59%L09 1,051,441,204 56%L10 322,865,034 74%

space, providing the wall-clock time and the reduction with respect to LEB1 generated bythe remaining rules, expressed as a percentage:

NS1 −NSx

NS1× 100, (4.12)

where NS1 is the number of evaluated simplices using LEB1 and NSx is the same butusing the LEBx strategy. Rules LEBα, LEBC , and LEBM provide higher reductions thanLEBW , which in some cases performs more simplex evaluations than LEB1. The interestingaspect is that selection rules LEBC and LEBM are easier to generate than selection ruleLEBα in terms of computational operations; although it is not well-reflected due to the lowcomputational time.

Table 4.6 contains the results for dimension n = 5. In this case, LEBC and LEBM

provide a higher reduction than the rest of rules. LEBα shows smaller reductions comparedto to LEBC and LEBM rules, in contrast to dimension n = 4 reported in Table 4.5. Interms of computational time, the reduction of the number of evaluated simplices for LEBα

(less or equal to 20%) is not reflected in the execution time, due to the overhead caused bythe branching rule. The rest of the rules shows a reduction in the execution time.

Table 4.7 provides numerical results for a subset of the test instances that could be

Page 75: High Performance Computing in Deterministic Global ...

4.3 Summary 61

Table 4.5: Experimental results for n = 4 using K-bounds

Id. Time1 Timeα LEBα TimeC LEBC TimeM LEBM TimeW LEBW

B01 30 25 35% 21 35% 22 35% 31 −0%B02 1 2 28% 1 28% 1 28% 1 5%B03 3 4 25% 2 25% 2 25% 3 2%B04 3 3 54% 1 54% 2 54% 3 5%B05 64 62 31% 46 31% 46 31% 66 −0%B06 5 6 26% 5 26% 4 26% 6 −1%B07 2 3 29% 2 29% 2 29% 2 6%B08 78 98 13% 72 14% 70 13% 80 1%B09 2 3 27% 1 27% 1 27% 2 −1%B10 1 2 30% 1 30% 1 30% 1 3%B11 3 5 30% 2 30% 2 30% 3 −3%

run within a reasonable computation time for dimension n = 6. One can observe that therules follow a similar behaviour compared to Table 4.6. LEBα is not practical, because theexecution time is increased by three times in some instances. Having the same reduction interms of number of evaluated simplices, LEBC shows an execution time smaller than LEBM .LEBM performs more operations than LEBC . For six-dimensional cases, this computationaloverhead is reflected in the execution time.

A side result not related to our original research question is the effectiveness of thebounding rule. We measured that the more-sophisticated bound Φ 2 is less effective for79% of the evaluated sub-simplices than the simpler lower bound Φ 1.

Numerical results obtained from the Lipschitz test-bed are shown in Table 4.8. Experi-mental results show that the number of evaluated simplices using LEBα, LEBC or LEBM

is considerably reduced with respect to the use of LEB1, up to 69% for the L01 instance.LEBα causes a computational overhead in the evaluation of the branching rule. Never-theless, this overhead is compensated due to the significant reduction of the number ofevaluated simplices in most of the cases and thus the execution time is less. Test problemswith a dimension n > 4 experience a slight reduction in the number of evaluated simplices,not enough to improve the execution time of the algorithm if LEBα is used. LEBW providesa low effectiveness compared to the rest of the rules, sometimes even as bad as LEB1.

4.3 Summary

Every rule in the B&B scheme plays an important role in terms of efficiency. Both selectionand branching rules have been studied in this chapter. The selection rule influences thememory requirement of the algorithm. Depth-first search is the strategy which requires leastmemory space. Just in a few cases, the number of evaluated simplices using this selectionrule is slightly greater than using the hybrid selection rule, conbination of best-first anddepth-first search strategies. Having in mind that the evaluation of a simplex requires fewcalculations for this type of problems, depth-first is appropriate.

Selecting the correct edge when the simplex is bisected, leads to a reduction of thenumber of evaluated simplices as the experimental results of this work show. One of the

Page 76: High Performance Computing in Deterministic Global ...

62 Simplicial Branch and Bound applied to Global Optimization

Table 4.6: Experimental results for n = 5 using K-bounds

Id. Time1 Timeα LEBα TimeC LEBC TimeM LEBM TimeW LEBW

B01 1,152 2,077 20% 919 26% 918 26% 1,121 7%B02 95 201 17% 78 24% 79 25% 86 13%B03 239 417 18% 194 24% 194 24% 220 11%B04 1,378 2,979 17% 1,213 20% 1,356 20% 1,383 3%B05 442 569 16% 372 18% 373 19% 436 3%B06 1,359 2,265 15% 1,153 20% 1,151 20% 1,337 4%B07 203 423 16% 166 25% 167 25% 182 14%B08 20,487 26,914 11% 18,072 18% 17,483 18% 19,230 8%B09 135 311 16% 116 22% 117 22% 131 7%B10 67 145 19% 56 25% 56 25% 63 10%B11 466 1,137 13% 421 18% 426 18% 467 5%

Table 4.7: Experimental results for n = 6 using K-bounds

Id. Time1 Timeα LEBα TimeC LEBC TimeM LEBM TimeW LEBW

B02 8,514 26,135 18% 6,832 27% 7,197 27% 8,512 6%B03 23,414 57,683 18% 18,610 26% 19,271 26% 23,166 6%B07 25,655 79,506 17% 20,897 26% 22,078 26% 26,168 4%B09 12,123 36,775 17% 10,126 24% 10,565 24% 12,049 6%B10 5,369 15,995 21% 4,297 28% 4,520 28% 5,386 5%

questions in this chapter is how different heuristics to select the longest edge in simplicialB&B algorithms may influence the size of the generated search tree. To investigate thisquestion, a simplicial B&B algorithm has been used where non-optimal area is cut awayvia the concept of an upper fitting.

Evaluating five LEB heuristics on two sets of test instances in dimensions n = 4, 5, 6,shows that rules called LEBC and LEBM give the best performance for all the dimensions.The search tree can be reduced up to about 69% compared to the rule that selects the firstlongest edge as stored in an implementation of the simplicial B&B.

Page 77: High Performance Computing in Deterministic Global ...

4.3 Summary 63

Table 4.8: Experimental results using Lipschitz bounds

Id. Time1 Timeα LEBα TimeC LEBC TimeM LEBM TimeW LEBW

L01 5,363 2,953 69% 1,761 69% 1,804 69% 5,566 1%L02 497 638 34% 358 34% 362 34% 543 −4%L03 1,081 751 63% 428 63% 432 63% 1,192 −6%L04 1,121 789 62% 460 62% 463 62% 1,242 −5%L05 1,079 1,102 46% 633 46% 632 46% 1,149 −2%L06 1,227 1,248 46% 726 46% 724 46% 1,289 0%L07 71 66 49% 40 49% 40 49% 71 5%

L08 8,082 11,094 14% 6,721 20% 6,739 20% 7,751 6%L09 2,219 3,006 13% 1,816 21% 1,921 21% 2,051 9%

L10 2,631 2,743 22% 1,746 34% 1,764 34% 2,254 16%

Page 78: High Performance Computing in Deterministic Global ...
Page 79: High Performance Computing in Deterministic Global ...

CHAPTER

5Parallelization of

simplicial Branch andBound

Branch and Bound (B&B) algorithms are known to exhibit an irregularity of the searchtree. Therefore, generating a parallel code is a challenge. The efficiency of a B&B algorithmdepends on the chosen B&B rules, that are typically problem dependent. Selection rule anddata management structures are usually hidden to programmers for frameworks with a highlevel of abstraction, as well as the load balancing strategy, when the algorithm is run inparallel.

There are several ways to design parallel B&B algorithms. Specifically, we are interestedin how to apply parallelization to the multidimensional Global Optimization problem, de-scribed in Chapter 4. Running B&B in parallel involves the choice of the most appropriateprogramming language, library or skeleton to solve a problem in parallel using this algorithmdesign paradigm. Section 5.2 discusses the decisions taken to parallelize Algorithm 4.1 con-sidering three shared-memory approaches: the Bobpp framework, the TBB library, and acustomised POSIX threaded approach.

Section 5.3 investigates models that efficiently map B&B algorithms on a distributedcomputer architecture using the multidimensional Global Optimization case. We investigatehow to run Algorithm 4.1 in parallel depending on the used platform, such that the imple-mentation is efficient. A combination of MPI and Pthreads will be studied in Section 5.3:MPI for distributed computation (inter-node) and Pthreads for multi-core computation(intra-node). That model adapts the algorithm to the characteristics of the architecture athand with an increasing number of nodes.

In the literature, B&B algorithms coded in MPI or OpenMP, discard not only the broad-casting of the best found upper bound of the solution due to its high cost/performance butalso the use of dynamic load balancing [77]. In this chapter, we check how broadcasting theupper bound affects the performance of the developed MPI-Pthreads approach. Dynamicload balancing is performed in intra-node space through dynamic generation of threads.

65

Page 80: High Performance Computing in Deterministic Global ...

66 Parallelization of simplicial Branch and Bound

Two load balancing strategies will be considered at inter-node level: static and dynamicload balancing. Experimental results show which designs perform better on which type ofinstances for the used computational architecture. Results show performance improvementscompared to OpenMP and MPI versions used in previous works [77].

5.1 Branch and Bound in parallel

Parallelizations of B&B algorithms have been widely studied for a large number of applica-tions and machine architectures. A highly-cited survey was performed in 1994 by Gendronand Crainic [30]. Parallel B&B algorithms can be classified based on several characteristics[30]. One of them is based on the working set, distinguishing between single and multipleworking sets. Multiple working sets seem to be appropriate when the number of processorunits is large, because each process can work independently on its working set, avoidingthe bottleneck in the concurrent access to a shared single set. A hybrid approach may usemultiple working sets, where each set can be associated to several processes. The numberof possible choices may be large.

In the literature, one can find several strategies to parallelize a B&B algorithm. Themost popular ones are enumerated as follows:

1. The evaluation of a node can be performed in parallel.2. The nodes of the search tree can be processed in parallel, i.e., the search tree is built

in parallel.3. A combination of strategies 1 and 2.

Strategy 1 is suitable for problems where the computational burden of evaluating a nodeis high. On the other hand, strategy 2 seems appropriate when the number of generatednodes is high and the difficulty to evaluate a node is low. The Global Optimization problemconsidered here belongs to the latter class of problems (see Chapter 4). Hence, the parallelversion of the B&B algorithm investigated here consists of building the B&B tree in parallelby performing operations on several subproblems simultaneously. Other possible approachescan be found in [30].

There are two design schemes in parallel B&B: the master-slave paradigm and thedistributed B&B. In the first approach, a master process controls the working set of livesubproblems and stores the upper bound, and the slave processes performs the evaluation ofthe subproblems. When the search space is irregular, load balance problems usually appearand they may deteriorate performance as the number of working processes increases. B&Bapproaches usually perform a load balancing strategy to improve their efficiency.

Many frameworks have been proposed since 1994 to facilitate the development of par-allel B&B algorithms, such as PPBB, ZRAM, BOB, SYMPHONY, PICO, PeBBL, ALPS,BOBPP, OOBB, and MallBA, to name a few. These frameworks can be classified accordingto the provided methods (B&B, branch and cut, dynamic programming, etc.) and the tech-niques used to code the algorithm (programming language, parallelization libraries, etc).An overview of some of these frameworks can be found in [19]. The use of a framework isnot always the best approach in terms of efficiency. A framework offers a general skeletonto code a specific instance of the algorithm. In some cases, the developer is not concernedabout aspects like the data structure or the way in which the search is performed. In gen-eral, it is difficult for a framework to provide the best performance compared to a customdeveloped algorithm, because some characteristics of the problem are specific and they arenot taken into account by the framework.

Page 81: High Performance Computing in Deterministic Global ...

5.2 Shared-memory models 67

B&B methods are more efficient on multi-core than on GPU of FPGA systems forproblems with few arithmetic computations and challenging memory handling due to thesize of the search tree [25]. We focus our research on clusters of symmetric multiprocessorsdue to their availability in High Performance Computing environments and because newcomputing nodes can be added easily.

5.2 Shared-memory models

We consider three approaches with a different level of abstraction. The first one is basedon the Bobpp framework, the second makes use of the Thread Building Blocks library, andthe last one is a Pthread-based model.

5.2.1 Bobpp framework

Bobpp (pronounced “bob plus plus”) is a C++ framework that facilitates the creation of se-quential and parallel solvers based on tree search algorithms, such as Divide-and-Conquer,and B&B [22]. The framework has been used to solve Quadratic Assignment Problems [29],Constrained Programming [63, 64, 65], and biochemical simulation [2]. The possible paral-lelizations are based on Pthreads, MPI or Athapascan/Kaapi libraries. Here, we focus onthe behaviour of the B&B algorithms for sequential and threaded versions using Pthreads.

Bobpp provides a set of C++ templates or skeletons on top of which the user has tocode some classes in order to configure the behaviour of the algorithm. These templatesare aimed at facilitating the development of search algorithms over irregular and dynamicdata structures. The developer may use the example classes provided in the framework,but may also reimplement these classes to code a more specific algorithm. The classes aresummarized below:

• The Instance class stores all global data of the problem. These data must be ini-tialized when the execution starts, because the data are read-only during the search.This class contains a method to define the initial search space before the search starts.• The Node class enables the storage of the data and contains the methods associated

to a node (subspace of the search space).• The GenChild class contains the division method.• The Algo class contains the execution code of the main loop of the algorithm.

The Bobpp framework allows different ways to schedule the search (depth-first search, best-first search, etc.) as well as provides templates for different data structures. For instance,our B&B implementation selects depth-first search and relies on a priority queue that servesas a working set storing the simplices that have to be processed. Using one global set, oneof the threads extracts a node from the set. Then, it is divided according to the branchingrule, generating two or more children. In case the termination rule is not satisfied, thegenerated nodes are evaluated and added to the set. Otherwise, the nodes are discarded.In order to avoid the bottleneck caused by accessing to the same set, several sets (priorityqueues) can be used. The dynamic load balancing strategy followed in Bobpp relies onwork-stealing among sets in order to achieve better performance. The optimal number ofsets is difficult to predict, because it depends on the problem characteristics and systemresources.

Page 82: High Performance Computing in Deterministic Global ...

68 Parallelization of simplicial Branch and Bound

5.2.2 TBB

The Intel Threading Building Blocks (TBB) library provides a framework to develop parallelapplications in C++ [80]. TBB facilitates developing loop and task-based algorithms withhigh performance and scalability, even for fine-grained and irregular applications. TBBclass task_group is a high-level interface that allows to create groups of potentially paralleltasks from functors or lambda expressions. TBB class task is a low-level interface thatallows for more control, but is less user-friendly.

Nodes in a B&B search tree seamlessly map onto tasks. Each thread keeps a “readypool” of ready-to-run tasks. TBB features a work-stealing task scheduler that automaticallytakes care of the load balancing among pools. From the developer point of view, using atask-based approach could be simpler than using a threaded-based approach, because theuser does not need to code the data structure to store and schedule the pending tasks. Thedeveloper only has to spawn the tasks and the task scheduler decides when to execute them.

Potential parallelism is typically exploited by a split/join pattern. Two basic patternsof split/join are supported. The most efficient, but also programming demanding, is thecontinuation-passing pattern, in which the programmer constructs an explicit “continua-tion” task. The parent task creates child tasks and specifies a continuation task to beexecuted when the children complete. The continuation inherits the parent’s ancestor. Theparent task then exits; it does not block waiting for its children. The children subsequentlyrun, and after they (or their continuations) finish, the continuation task starts running.This pattern has been used to develop the parallel version of the B&B algorithm describedin Chapter 4.

In addition to the productive programming interface and to the work-stealing load bal-ancing, TBB features two additional advantages when it comes to parallel tree traversals.First, TBB exhibits a good trade-off between breadth-first and depth-first traversals of thetree. The first one leverages parallelism, while the second avoids too much memory con-sumption. TBB relies on breadth-first only when stealing work, but otherwise it is biasedtowards going deep in its branch until a cut-off criterion (not used in the experimentation)is met and that way, the remaining sub-tree is processed sequentially (in order to avoid gen-erating too many fine-grained tasks). Second, TBB can efficiently keep all the cores busywithout oversubscribing them1. This is, in TBB only one thread/worker per core should becreated, but the programmer is responsible of coding an algorithm that generates enoughtasks to feed all the workers.

5.2.3 Pthreads

The Pthreads model permits dynamic load balancing through thread generation facilitatingto handle irregular data structures as those presented in B&B algorithms [14, 83]. Thismodel is based on the dynamic generation/destruction of threads with multiple sets [30].Each thread handles its own working set. This strategy was used to suffer less from memorycontention problems than a single set, where the working set is shared by all threads [14].

The execution starts creating one thread in charge of the complete search space. Inthis model, a thread can create a new thread if there is enough work to share, up to themaximum number of threads MaxThreads, defined by the user. The newly-generated threadwill receive half of the simplices stored in its parent [14]. The best upper bound fU is sharedbetween the threads using a global shared variable. A thread dies when it ends its assignedwork.

1Oversubscription usually results in context switching overhead, cache cooling penalty and lock preemp-tion and convoying.

Page 83: High Performance Computing in Deterministic Global ...

5.3 Message passing models 69

The time a core is waiting for a new thread, is given by the time (i) a thread checksthat the number of threads is less than MaxThreads; (ii) to divide its working set; (iii) tocreate a new thread; and (iv) to migrate the new thread to an idle process unit (or lessoverheaded unit in case of having more threads than processing units). The migration ofthreads is done by the operating system, which is out of the scope of this study. Depth-firstsearch using a stack has been used as selection rule.

5.3 Message passing models

Every MPI process performs a sequential phase in which the feasible region is divided byface-to-face vertex triangulation. If the number of generated sub-simplices is less than thenumber of processes (n! < N), a sequential B&B process is initiated until the number ofunexamined simplices becomes equal to the number of MPI processes (|Λ| ≥ N) or a givenlimit value. Then, each MPI process selects the corresponding simplices it is in charge of.The broadcasting of the global upper bound fU is performed by an MPI message to otherMPI processes as soon as a new improved fU value is found by one of the threads in a node.

5.3.1 Hybrid MPI-Pthreads

The Pthreads model (with intrinsic dynamic load balancing) is used for parallelism withina node. Moreover, an MPI process is used per node for the initial work distribution andgathering the final solution. Within each MPI process, the search is performed by severalthreads. The execution starts using one thread per MPI process. The initial partition isdone by each MPI process generating the initial n! simplices and continuing the executionover its corresponding simplices.

5.3.2 Inter-node dynamic load balancing

There exist several methods for dynamic load balancing between nodes. The main questionsare:

• How to inform about the workload of a node: centralized (master-slave) or dis-tributed? Locally (only between neighbours) or globally (to any node)?• When to inform about the current workload?• Is the dynamic balancing initiated by loaded or unloaded nodes?• Where to send workload: locally or globally?• Which amount of workload is migrated?• Which is the best metric to determine if a node is loaded or unloaded?• How to determine the termination of the parallel algorithm?

In this study we use a centralized approach between MPI processes in order to reducecommunications where one MPI process manages the execution and also contributes withwork. This is valid when the number of nodes is small. For a large number of nodes, ahierarchy of managers can be built. Each MPI worker will inform the manager processabout the current workload of a node when a thread ends. This ensures that enough workis done between two reports. Workload balancing is initiated by unloaded nodes when theyrun out of work by sending a report with zero workload to the manager. Then, the managerasks to the most loaded process (donor) to send workload to the unloaded one. Therefore,the migration of workload is global. The donor process stops its most loaded thread, packs

Page 84: High Performance Computing in Deterministic Global ...

70 Parallelization of simplicial Branch and Bound

the threads’ workload and sends the packet to the specified destination. In our case, theworkload is based on the number of simplices waiting for evaluation. Future studies willtry to find better estimators for the pending work [10]. The algorithm finishes when themanager receives zero workload from all workers and runs out of work itself.

5.4 Experimental results

The two test-beds used to evaluate the performance of the parallel models introduced inSections 5.2 and 5.3 have been obtained from Chapter 4. For the Baritompa test-bed, theaccuracy δ is the same as the accuracy used in Chapter 4, i.e., δ = 0.05 (f − f∗). For theLipschitz test-bed, we used an accuracy of δ = 1

8 L2 for four-dimensional instances, δ = 38 L2

for five-dimensional instances and δ = 128 L2 for the six-dimensional instance. Notice that

test function L07 is left out from the parallel experiments due to its low computationalcost. In contrast, test function B08 is left out from the parallel experiments due to its highcomputational cost for the Bobpp version.

Table 5.1 shows the test-bed used to evaluate the parallel strategies studied in this chap-ter. Numerical results reveal the computational effort in terms of the number of evaluatedsimplices and the wall-clock time for the sequential versions of the Bobpp framework andthe custom-made version coded in serial C. The number of evaluated simplices is similar inboth cases. However, the consumed wall-clock time of Bobpp is greater than the serial Cversion.

5.4.1 Shared-memory approach

Table 5.2 shows the parallel performance of the Bobpp implementation varying the numberof priority queues (pq) and the number of threads. For 2, 4, and 8 threads, the performanceis higher if the number of priority queues matches the number of threads. However, theoptimal number of priority queues is 8 for 16 threads. For 16 threads and pq = 16, workloadbalancing overhead picks up and in this case, the Relative Load Imbalance (RLI, introducedin Chapter 1) is higher (between 6% and 17%), whereas RLI is always less than 9% for theother cases. The workload imbalance is small, but the high cost of the dynamic workloadbalancing hinders the parallel performance.

Table 5.3 shows the speedup of the Pthread and TBB versions for the test functionsvarying the number of threads. The best sequential version is used as reference. TBBperforms slightly better than Pthread, although both implementations show near-linearspeedup.

Figures 5.1b, 5.1d and 5.1f show the CPU Usage Histogram of the three implementa-tions, that represents the CPU usage in terms of number of active cores during the executionof the algorithm. Bars represent the fraction of the total time during which a given numberof cores are simultaneously running. The idle bar represents the fraction of time duringwhich all cores are waiting, i.e., no thread is running. The histogram has been obtained byprofiling test problem number 2 (Dixon & Price) using 16 threads (with pq = 8 for Bobpp).The profiler indicates that the performance of Bobpp suffers from a high spin time (45%).The executions for pq = 1 and pq = 16 show an even larger spin time of 82% and 60%,respectively.

In both Pthread and TBB implementations, the only synchronization point betweenthreads is due to the update of the upper bound (stored in a global variable and protectedwith locks). For our test bed this update is not a frequent operation. This fact, along with

Page 85: High Performance Computing in Deterministic Global ...

5.4 Experimental results 71

2

4

8

16

2 4 8 16

Spee

dup

Number of threads

(a) Relative speedup for Bobpp

0

2

4

6

8

10

12

Idle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Tim

e

Number of cores

(b) CPU usage histogram for Bobpp

2

4

8

16

2 4 8 16

Spee

dup

Number of threads

(c) Speedup for TBB

0

1

2

3

4

5

6

Idle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Tim

e

Number of cores

(d) CPU usage histogram for Pthread

2

4

8

16

2 4 8 16

Spee

dup

Number of threads

(e) Speedup for Pthread

0

1

2

3

4

5

Idle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Tim

e

Number of cores

(f) CPU usage histogram for Pthread

Figure 5.1: Speedup and CPU usage histogram for Bobpp, TBB and Pthread model.

Page 86: High Performance Computing in Deterministic Global ...

72 Parallelization of simplicial Branch and Bound

Table 5.1: Test-bed sequentially solved using Bobpp and custom-made version

Bobpp version Custom-made version

Id. N. Eval. S. Elapsed time N. Eval. S. Elapsed time

B01 1,033,107,720 8,294 1,010,945,400 1,152B02 123,576,086 302 123,575,850 95B03 219,996,634 522 219,996,634 239B04 1,877,094,680 3,099 1,877,094,680 1,378B05 166,881,758 2,453 166,831,502 442B06 989,052,844 9,211 989,052,844 1,359B07 260,996,816 715 261,009,818 203B09 175,643,470 432 175,613,436 135B10 87,628,502 196 87,628,502 67B11 603,679,830 1,539 603,678,276 466

L01 895,300,342 3,605 893,936,704 872L02 169,291,922 336 173,657,364 141L03 293,123,212 647 295,477,634 242L04 293,861,420 749 295,300,328 249L05 189,855,508 379 193,891,104 153L07 141,128,844 328 141,147,256 111L08 218,010,726 1,373 217,880,340 483L09 123,020,028 543 122,890,838 262L10 322,890,626 5,079 322,865,034 2,631

an efficient load balancing policy, results in an almost linear speedup for both implementa-tions. The small difference in speedup between Pthread and TBB can be analysed studyingFigures 5.1d and 5.1f, where it is noticeable that TBB achieves full utilization of the 16cores. On the contrary, Pthread leaves one, two or even all cores idle for an small fractionof time which may be convenient if cooling or energy saving are taking into account. Thisbehaviour is due to the dynamic thread creation implemented in the Pthread approach.

In TBB, it is worth mentioning that it is more challenging to monitor some statisticsof the execution. For instance, the Pthread version efficiently keeps track of the numberof evaluated simplices by using per-thread private variables. However, in TBB, tasks arenot tied to a particular thread so per-thread privatization of a variable requires expensivesystem calls. A straightforward alternative is to store the number of evaluated simplices asa global atomic variable. However, the frequent and concurrent optimization of this singleglobal memory position may easily kill the scalability of the code, mainly due to cacheinvalidations, but also due to contention in the access to the atomic variable. Therefore,such statistics have been deactivated in the production version of the TBB implementationin order to collect the data in Table 5.3.

Page 87: High Performance Computing in Deterministic Global ...

5.4 Experimental results 73

Table 5.2: Elapsed time of the Bobpp version varying the number of priority queues

No. 16 threads 8 threads 4 threads 2 threads

16 8 1 8 4 1 4 2 1 2 1

B01 1,188 651 1,710 1,120 1,219 1,765 2,223 2,341 2,716 4,345 4,967B02 177 61 205 63 84 194 101 128 210 184 309B03 280 106 356 113 126 355 171 220 376 314 535B04 1,476 897 3,074 920 1,022 3,011 1,203 1,636 3,021 2,068 4,358B05 247 166 276 321 336 367 628 652 681 1,261 1,331B06 986 690 1,667 1,223 1,327 1,776 2,423 2,665 2,848 4,787 5,379B07 338 130 423 137 157 414 214 285 450 394 679B09 229 86 285 91 104 274 186 180 302 265 439B10 117 43 144 44 54 140 66 86 147 121 208B11 812 296 982 309 367 954 502 606 1,037 905 1,474

L01 739 450 1,504 558 662 1,485 977 1,195 1,695 2,001 2,680L02 200 82 279 84 98 270 109 159 279 215 396L03 331 142 491 134 167 478 204 273 488 405 695L04 336 146 492 153 174 476 229 298 497 451 728L05 203 90 312 92 105 310 120 175 315 250 452L07 171 68 234 69 84 224 100 154 241 201 350L08 286 124 362 193 222 371 368 420 517 747 911L09 148 63 206 83 95 201 151 177 248 306 421L10 595 360 540 675 705 794 1,333 1,383 1,492 2,671 2,912

5.4.2 Distributed-memory approach

Table 5.4 illustrates the results of the experimentation using 2 MPI (32 threads), 4 MPI(64 threads), and 8 MPI (128 threads) processes without fU broadcasting and inter-nodedynamic load balancing. Based on previous experimentation, we will use 16 threads perMPI process and consequently per node. It can be observed that the Search OverheadFactor (SOF) is non-decreasing in the number of MPI processes. The SOF is 1 except fora few cases. Instance L02 has a high SOF. Therefore, the speedup is not good, becausethe work done by the parallel version is more than three times bigger than the sequentialversion. This instance suffers a detrimental anomaly, explained in Section 1.2.5. The RLIdoes not show a predictable behaviour. The maximum RLI reaches values around 60%.The values of RLI indicate that the static load balancing is not good enough to reachthe maximum performance. Dynamic load balancing in the intra-node layer alleviates theimbalance produced by the irregular search and helps to reach an acceptable speedup,compared with a version which only uses MPI [45, 44]. The performance of the parallelversion depends on the distribution of the initial simplices, which is determined by thenumber of MPI processes. The best speedup is achieved by instance B07 due to a goodinitial workload balance.

Table 5.5 shows experimental results from the MPI-Pthread version with fU broadcast-ing and inter-node dynamic load balancing. Comparing Table 5.4 with Table 5.5, the value

Page 88: High Performance Computing in Deterministic Global ...

74 Parallelization of simplicial Branch and Bound

Table 5.3: Speedup of Pthread and TBB versions varying the number of threads

Pthreads TBB2 4 8 16 2 4 8 16

B01 1.9 3.8 7.6 15.2 1.9 3.8 7.6 15.2B02 1.8 3.7 7.4 14.8 1.8 3.7 7.5 14.8B03 1.9 3.8 7.6 15.1 1.9 3.8 7.6 15.1B04 1.8 3.7 7.4 14.9 1.9 3.7 7.5 14.8B05 2.0 3.9 7.8 15.6 2.0 3.9 7.8 15.6B06 1.9 3.8 7.7 15.3 1.9 3.8 7.7 15.2B07 1.8 3.7 7.4 14.7 1.9 3.7 7.5 15.0B09 1.8 3.7 7.4 14.6 1.9 3.7 7.4 14.9B10 1.9 3.7 7.4 14.8 1.9 3.7 7.5 14.9B11 1.8 3.7 7.3 14.8 1.9 3.7 7.5 14.9

L01 1.9 3.8 7.5 15.1 1.9 3.8 7.7 15.3L02 1.8 3.7 7.2 13.8 1.8 3.8 7.1 13.4L03 1.8 3.7 7.5 14.9 1.9 3.8 7.6 15.2L04 1.9 3.8 7.5 14.9 1.9 3.8 7.5 14.7L05 1.8 3.6 7.4 14.8 1.9 3.8 7.6 15.5L07 1.9 3.7 7.4 14.6 1.9 3.8 7.5 15.2L08 1.9 3.9 7.7 15.5 2.0 4.0 8.0 16.0L09 1.9 3.9 7.7 15.3 2.0 4.0 7.9 15.8L10 1.9 3.9 7.8 15.6 2.0 4.0 8.0 16.1

of SOF is smaller than the reported in Table 5.4 for the cases where SOF is greater than1. The average RLI is reduced by implementing the dynamic load balancing strategy tothe inter-node layer. The speedup is increased although is still less than linear. A possiblereason is the communication between MPI processes, nonexistent for the version with staticload balancing between MPI processes.

5.5 Summary

This chapter compares three shared-memory implementations using different abstractionlevels of a Global Optimization B&B algorithm. Results show that features like the selectionrule, load balancing method and customizable number of working sets are important. Thehighest abstraction code based on the Bobpp framework obtains a low speedup for most ofthe test problems. The lowest abstraction code based on Pthreads uses a load balancingmethod inherit to dynamic thread creation and obtains a similar performance as the middleabstraction code based on TBB. Using a dynamic number of threads opens the possibility toadapt the parallel level of the application to the current available computational resourcesduring runtime.

In order to use a large number of cores, a hybrid MPI-Pthread version of the algo-rithm was elaborated. This facilitates load balancing between MPI processes. Numerical

Page 89: High Performance Computing in Deterministic Global ...

5.5 Summary 75

Table 5.4: MPI-Pthread approach using 16 threads per MPI process

2 MPI 4 MPI 8 MPINo. SOF RLI Sp SOF RLI Sp SOF RLI Sp

B02 1.0 25.9% 26.6 1.0 13.8% 54.9 1.0 11.6% 109.6B03 1.0 41.4% 24.3 1.0 28.2% 48.7 1.0 46.7% 72.8B07 1.0 0.0% 30.3 1.0 0.0% 60.9 1.0 0.0% 121.3B09 1.0 0.8% 30.2 1.0 2.7% 59.8 1.0 8.6% 112.5B10 1.0 51.4% 22.7 1.0 32.1% 46.5 1.0 42.6% 76.3

L01 1.0 30.6% 25.0 1.0 34.9% 42.9 1.1 30.6% 83.5L02 3.8 60.4% 5.5 3.3 42.1% 12.6 3.6 26.3% 25.8L03 1.1 2.9% 26.8 1.1 24.2% 43.9 1.7 46.3% 44.1L04 1.2 38.8% 20.2 1.0 15.9% 51.9 1.3 53.0% 47.7L05 1.0 3.7% 29.4 1.1 18.2% 48.6 1.1 14.6% 99.0L06 1.0 10.8% 28.3 1.0 6.0% 57.0 1.0 9.5% 109.9L08 1.0 23.8% 26.4 1.0 12.0% 54.8 1.0 5.9% 114.4L09 1.0 19.1% 27.5 1.0 9.7% 55.0 1.0 13.0% 103.8L10 1.0 5.9% 30.1 1.0 8.7% 56.0 1.0 10.3% 108.1

Avg. 1.2 22.5% 25.3 1.19 17.75% 49.53 1.27 22.79% 87.77Máx. 3.8 60.4% 30.3 3.27 42.13% 60.91 3.61 53.03% 121.33Mín. 1.0 0.0% 5.5 1.00 0.00% 12.61 1.00 0.01% 25.75

results show an improvement over earlier published results on parallel Lipschitz Global Op-timization [77]. Including both the broadcasting of the best upper bound solution and theinter-node load balance reduce the wall-clock time. However, this strategy provides a lessthan linear speedup. Different parameters of the parallel algorithm must be tuned in orderto improve the actual speedup.

Page 90: High Performance Computing in Deterministic Global ...

76 Parallelization of simplicial Branch and Bound

Table 5.5: Fully-dynamic approach sharing upper bound and work

2 MPI 4 MPI 8 MPINo. SOF RLI Sp SOF RLI Sp SOF RLI Sp

B02 1.0 0.0% 28.6 1.0 7.9% 55.0 1.0 0.9% 113.5B03 1.0 37.1% 24.5 1.0 20.2% 51.6 1.0 1.1% 115.0B07 1.0 0.3% 28.5 1.0 0.1% 57.4 1.0 0.3% 114.0B09 1.0 0.4% 28.5 1.0 0.5% 57.4 1.0 2.5% 112.3B10 0.9 47.0% 23.0 1.0 22.9% 49.9 1.0 1.3% 113.5

L01 1.0 23.4% 25.9 1.0 3.7% 55.6 1.0 1.2% 109.4L02 2.1 0.5% 13.5 1.4 1.6% 39.2 1.9 0.7% 58.7L03 1.0 0.2% 27.7 1.0 1.0% 57.6 0.9 1.8% 122.9L04 0.9 0.0% 30.5 0.9 1.0% 58.4 1.0 0.5% 111.5L05 1.0 6.3% 27.5 1.0 11.3% 51.2 1.0 0.3% 113.2L06 1.0 2.3% 27.4 1.0 0.2% 55.4 1.0 1.4% 110.9L07 1.0 22.8% 26.4 1.0 0.5% 57.0 1.0 0.3% 114.0L08 1.0 0.7% 28.3 1.0 3.2% 55.8 1.0 2.1% 110.5L09 1.0 5.3% 28.5 1.0 7.5% 56.2 1.0 7.4% 111.3

Avg. 1.1 10.45% 26.3 1.0 5.83% 54.1 1.1 1.55% 109.3Máx. 2.1 46.97% 30.5 1.4 22.92% 58.4 1.9 7.36% 122.9Mín. 0.9 0.01% 13.5 0.9 0.09% 39.2 0.9 0.28% 58.7

Page 91: High Performance Computing in Deterministic Global ...

Part II

Dynamic Programming

77

Page 92: High Performance Computing in Deterministic Global ...
Page 93: High Performance Computing in Deterministic Global ...

CHAPTER

6Dynamic Programming

applied to traffic control

Traffic lights can be controlled dynamically through rules reacting on the number ofwaiting vehicles at each light. A policy can be defined by a so-called Traffic Control Table(TCT). In [34], a Markov Decision Process (MDP) has been described and Value Iterationhas been applied to generate TCTs for isolated intersections. Value Iteration is a specialtechnique of stochastic Dynamic Programming (DP). This chapter studies the generationof a TCT-based rule that takes the arrival information of new vehicles into account. Thequestion is how to generate (in parallel) such a table for a intersection or a network of these.The generation is particularly difficult due to the computational work involved in applyingValue Iteration.

The problem is formulated as an MDP and solved using the Value Iteration algorithmbased on backward induction. We are specifically interested in exploiting the structureof the problem for simple infrastructures, with only a few traffic lanes, using a (parallel)algorithm.

The chapter is structured as follows. Section 6.1 introduces the concept of DP appliedto the generation of TCTs. In Section 6.2, an MDP model is described for the decisionscaptured in a TCT. Section 6.3 shows two infrastructures where the model is followed.Section 6.4 provides the algorithm of Value Iteration for the described problem. The ex-perimental results are given in Section 6.5. Findings are summarized in Section 6.6.

6.1 Introduction

Dynamic decision making represents a wide area in real life, where the main goal is takinga decision from a certain state, i.e., decide the action x being in state s and time t. Here,we consider the problem of minimizing the waiting time at traffic lights.

Traffic lights were introduced in the early twentieth century to make traffic safer in placeswhere traffic from different directions intersects at what is called a junction or intersection.To give priority to several traffic flows, the vehicles approaching from other flows must

79

Page 94: High Performance Computing in Deterministic Global ...

80 Dynamic Programming applied to traffic control

wait before getting right of way. The overall waiting time can be kept low if there existsa conveniently-controlled sequence in the traffic lights network. This problem has beenconsidered by theorists as well as engineers [59, 70, 72, 88].

The optimal traffic control in a traffic junction is based in what we call a Traffic ControlTable (TCT). This table determines which flow combination has priority given the currenttraffic state (number of waiting vehicles in each queue) in the traffic junction. To find aTCT which minimizes the vehicle waiting time, one can model the problem as an MDPand apply a Value Iteration algorithm based on backward induction [34, 35, 36]. The ideaof backward induction was introduced by Bellman in 1953 to solve stochastic DP models[8]. Backward induction can be applied in dynamic optimization problems where time isdiscrete, for solving stationary systems without a time horizon.

DP usually deals with a considerable number of states (Bellman’s curse of dimensional-ity). Executing the Value Iteration algorithm requires to store all this information and toaccess it in an efficient way.

6.2 Model description for a single intersection

The main goal is to calculate a TCT that determines how to manage the traffic lightsconsidering the traffic density, so that the overall waiting time is minimized in the long run.Given the current state of the traffic, the table should capture which flows should get greenlight. The problem is formulated as an MDP.

6.2.1 Goal

The waiting time of a vehicle is defined as the time from the moment a vehicle joins thequeue until it crosses the stopping line. The practical objective is to find a control thatminimizes the Expected Waiting (EW ) time in the long run. Little’s law formalizes therelationship between the arrival rate λ, EW time, and the number of waiting vehicles in aqueue (EQ) as follows:

EQ = λ× EW.

If the number of vehicles waiting in the queues is minimized, the waiting time is alsominimized. The MDP model does not need to keep track of the time a particular vehicle iswaiting, but it does so for the number of queued vehicles.

6.2.2 Model assumptions

The optimization problem of determining the traffic light control is modelled as a stochasticand discrete-time control process. Given the state s of the traffic and the lights, an action xmust be chosen from the available ones. The optimal choice x(s) is then derived via theprocess of Value Iteration.

The model is based on the following ingredients:

• Time is divided into slots.• A slot is the time required for a vehicle to leave a queue and cross an intersection. It

also determines the safety distance between two vehicles.• The decision of changing the traffic light state is implemented instantaneously at the

beginning of a slot.• After a slot with red light for all traffic, the light must be changed to green for one

combination.

Page 95: High Performance Computing in Deterministic Global ...

6.2 Model description for a single intersection 81

• The change from green to red light requires two slots of yellow: Y1 and Y2.• While the light is green or yellow, at most one vehicle can cross the intersection in a

slot.• To facilitate the analysis, the vehicle length is neglected.

6.2.3 Formulation as a Markov Decision Process

The notation used throughout this chapter is defined below. A crossing or intersectionconsists of F traffic flows or lanes, where each flow f consists of a lane and an associatedqueue. Vector q = (q1, . . . , qF ) stores the number of vehicles waiting in each queue. Eachqueue f is truncated at Qf vehicles to make the state space finite and to facilitate numericalcomputation of an optimal policy. We will also consider the possibility of having arrivalinformation for a single flow in a vector a. For a single intersection, the set of traffic lanesF = 1, . . . , F can be partitioned in C disjoint subsets called combinations. Conflictingtraffic flows do not get green simultaneously. The combinations are given beforehand anddetermine the instance of the problem. All the lanes within a combination receive green,red or yellow light simultaneously. If a combination receives green or yellow light, theothers receive red light. The model holds for a single intersection but can be generalizedfor networks of intersections. In the next section two simple cases are discussed, includinga simple network, which contribute to a better understanding of the model. The followingvariables are used to model the problem.

State s

A state s ∈ S is defined by the current light situation l and state of the traffic flows (at theend of a slot). The possible light states are denoted by index l ∈ L = 1, . . . , L, whereL = 1 + 3C as either all lights are red or only one of each of the C combinations has green,Y1 or Y2. The traffic flow state for this model is defined by q and a = (a1, . . . , aM ). Vectora represents the arrival information for a single flow, where at ∈ 0, 1 determines whethera vehicle arrives at this queue in t time slots from now. Here we assume arrival informationis available for just one flow.

Therefore, a state s is defined by s = (l, q, a). The number of possible states is given by

|S| = |L| ·∑f (Qf + 1) · 2M . (6.1)

Control action x

Given a state s, the action x ∈ X (s) ⊆ L immediately adjusts the lights.

State transition

During a slot, one vehicle arrives at traffic flow f with probability λf . The stochastic evente is defined as a vector of F elements e = (e1, . . . , eF ), where ef ∈ 0, 1 denotes whethera vehicle enters to the infrastructure at lane f , within the coming time slot. Let function∆f (x) denote whether lane f has right of way (∆f (x) = 1) or not (∆f (x) = 0) when thelights are changed due to action x. As a result of decision x, state (l, q, a) changes into state

T (x, s, e) = (x, (q + α−∆(x))+, (a2, . . . , aM , e1)), (6.2)

where α = (a1, e2, . . . , eF ) and y+ = maxy, 0.

Page 96: High Performance Computing in Deterministic Global ...

82 Dynamic Programming applied to traffic control

Objective function

The objective function is to minimize the number of vehicles waiting at the queues. Thecontribution to the objective function over a single time slot is c(s) =

f qf . The associatedcost in a general MDP usually depends on the state s as well as the decision x. In the modelwe are interested in, it depends on the state only, so we have the cost function c(s).

6.2.4 Bellman’s principle of optimality

The strategy x(s) is optimal [8] if there exists a value function v(s) and a scalar d such that

v(s)− d = c(s) + minx∈X (s)

[Ev(T (x, s, e))], ∀s ∈ S, (6.3)

where E symbolizes the expected value with respect to the stochastic event e. The transitionfunction T (x, s, e) describes the next state to be reached after taking the decision x in thestate s on event e according to (6.2).

Essential in the described model is that the number of events is finite, such that theycan be numbered as ei with probability of occurrence pi. In addition, we also consider thatthe countable state space is finite, such that the states are indexed j = 1, . . . , |S|. Thisimplies that the value function v can be captured by a vector V with elements Vj = v(sj).The Bellman equation (6.3) implies to find a valuation vector V and a constant d such that

Vj − d = c(sj) + minx∈X (sj)

i

piVk , (6.4)

where k is the state index value related to state T (x, sj , ei). Notice that the index valuesof the reachable states from state sj are a subset

Kj = k : sk = T (x, sj , ei) ∀i, ∀x ∈ X (sj)

of all states S. If one has a valuation V , also the optimum control value can be derivedfrom

Xj = x(sj) = arg minx∈X (sj)

i

piVk, (6.5)

where again Vk = T (x, sj, ei).

6.3 Studied cases of the TCT model

In [34], an identification is introduced to denote a specific infrastructure. For instance,I1F2C2 stands for a single intersection with F = 2 traffic flows (or lanes) and C = 2combinations. Figure 6.1 shows the infrastructures that will be elaborated:

F4C2: This single intersection consists of four flows from which traffic comes. Each direc-tion has a single lane with its corresponding queue at the stopping line. Additionally,lane 4 has arrival information. Figure 6.1a depicts this intersection. Flows are num-bered clockwise (1–4), rather than using the more detailed notation that is commonin traffic engineering. Stream 1 and 3 receive green simultaneously and are groupedin combination 1 (C1). Combination 2, C2, consists of the streams 2 and 4. At mostone combination at a time has right of way (when its lights are green or yellow).

Page 97: High Performance Computing in Deterministic Global ...

6.3 Studied cases of the TCT model 83

q1

q2a5 a4 a3 a2 a1

q3

q4l

(a) F4C2 infrastructure

a5 a4 a3 a2 a1

l2

q3

q4

l1q2

q1

(b) I2F2C2 infrastructure

Figure 6.1: Traffic infrastructures

I2F2C2: This arterial consists of two simplified intersections of the F4C2 type. We refer tothis case as the I2F2C2 infrastructure, where I2 indicates the number of intersectionsalong the arterial. The I = 2 intersections are numbered from left to right, i.e., fromWest to East. Vehicles from junction 1 may drive to junction 2. In this case, it isassumed that the time needed to drive from junction 1 to junction 2 is M slots, whichis the size of the arrival information in lane 4 of junction 2. The West-to-East flow iscalled the arterial. The traffic flow between the two intersections is the internal flow(f = 4). The flows entering the network from the outside are called the external flows(f = 1, 2, 3). Vehicles leaving lanes 1 and 2 can turn right with a probability of ρ. Tosimplify the model, ρ is taken as the same probability.

The specific features of both infrastructures are explained as follows.

6.3.1 State s

A state s is defined by s = (l, q, a), see Section 6.2. For the F4C2 case, the seven trafficlight states are

l ∈ L = 1, 2, 3, 4, 5, 6, 7= Red, GC1 , Y1C1 , Y2C1 , GC2 , Y1C2 , Y2C2,

where:

Page 98: High Performance Computing in Deterministic Global ...

84 Dynamic Programming applied to traffic control

• Red: Red light for all the combinations.• GC1: Green light for combination 1.• Y1C1: Yellow light for combination 1 (slot 1 of 2).• Y2C1: Yellow light for combination 1 (slot 2 of 2).• GC2: Green light for combination 2.• Y1C2: Yellow light for combination 2 (slot 1 of 2).• Y2C2: Yellow light for combination 2 (slot 2 of 2).

The traffic flow state for this case is defined by:

• q = (q1, q2, q3, q4). Queue length at the four flows.• a = (a1, . . . , aM ). Arrival information at flow 4.

The number of possible states for the F4C2 case is given by

|S| = 7 ·∑f (Qf + 1) · 2M . (6.6)

For the I2F2C2 case, two traffic lights appear and the state space of the traffic lights isL = (l1, l2) : l1 ∈ L1 ∧ l2 ∈ L2 with L2 being similar to L1 (but with C1 and C2 replacedby C3 and C4). In this way, the number of light states is 72 = 49, starting from (Red, Red)and finishing with (Y2C2, Y2C4).

The traffic flow state for this case is defined by:

• q = (q1, q2, q3, q4). Queue length at the four flows.• a = (a1, . . . , aM ). Arrival information at flow 4.

The number of possible states for the I2F2C2 case is given by

|S| = 49 ·∑f (Qf + 1) · 2M . (6.7)

6.3.2 Control action x

For the F4C2 case, the possible subsets X (s) ⊂ L are enumerated as follows.

X =

2, 5 = GC1 , GC2 if l = 1 = Red

2, 3 = GC1 , Y1C1 if l = 2 = GC1

4 = Y2C1 if l = 3 = Y1C1

1 = Red if l = 4 = Y2C1

5, 6 = GC2 , Y1C2 if l = 5 = GC2

7 = Y2C2 if l = 6 = Y1C2

1 = Red if l = 7 = Y2C2

(6.8)

In the I2F2C2 case, in total 49 subsets exist to describe the possible actions x = (x1, x2) ∈ L.

6.3.3 State transition

As F4C2 has four flows, each flow j with a fixed probability on arrival of a vehicle λj , theevent is defined as a vector e of four elements e = (e1, e2, e3, e4). For this simple case offour lanes, sixteen possible events can happen:

e ∈ (0, 0, 0, 0), (0, 0, 0, 1), . . . , (1, 1, 1, 1).

Page 99: High Performance Computing in Deterministic Global ...

6.4 Value Iteration through backward induction 85

The probability pi related to the events is:

i = 1, e = (0, 0, 0, 0) with p1 = (1− λ1)(1− λ2)(1− λ3)(1 − λ4)i = 2, e = (0, 0, 0, 1) with p2 = (1− λ1)(1− λ2)(1− λ3)λ4

...i = 16, e = (1, 1, 1, 1) with p16 = λ1λ2λ3λ4

The state transition for a single intersection is described by (6.2). Consider for instanceM = 5. During a slot, the arrival information denoted by a = (a1, a2, a3, a4, a5) shiftstowards a ← (a2, a3, a4, a5, e1), i.e., the arriving vehicles at the lane are approaching anda1 represents the vehicle that is added to the queue.

Depending on the traffic light state, the queue length will increase or not. Suppose thatcombination 2 has red light, this means that the light state takes a value l ∈ 1, 2, 3, 4,then ∆4(x) = 0. A vehicle which is one slot upstream away from queue 4, increases thenumber of vehicles in queue 4. When x sets the colour to green or yellow for combination2 (∆4(x) = 1), and the queue 4 is not empty (q4 > 0), the vehicle is added to the queue. Ifthe queue is empty (q4 = 0), the vehicle crosses the stop line of flow 4 without delay. Thetransition of queue 4 is given by q4 ← q4 + a1 −∆4(x).

For the rest of the lanes, there exists no information about the arrival of new vehiclesto the queue. Therefore, the next state of queue f ∈ 1, 2, 3 depends on ef and action x.The transition of queue f is given by qf ← (qf + ef −∆f (x))+ vehicles.

The I2F2C2 infrastructure also has four flows but only three external flows coming intothe system. Therefore, the number of possible events is 23 = 8. The probability pi relatedto the events is:

i = 1, e = (0, 0, 0) with p1 = (1 − λ1)(1 − λ2)(1 − λ3)i = 2, e = (0, 0, 1) with p2 = (1 − λ1)(1 − λ2)λ3

...i = 8, e = (1, 1, 1) with p8 = λ1λ2λ3

For lanes f = 1, 2, 3, the queue length transition is given by qf ← (qf + ef −∆f (x))+. Forlane 4 holds q4 ← (q4 + a1 −∆4(x))+.

The incoming traffic for lane 4 is set by vehicles leaving queue f ∈ 1, 2 from the left-hand side intersection. If a queue is not empty (qf > 0) or a vehicle arrives to the queue(ef = 1), element a5 is 1 if ∆j = 1, for any flow j = 1, 2. Thus, the arrival informationchanges over a slot from a = (a1, a2, a3, a4, a5) to

a← (a2, a3, a4, a5, ∆1(x1) ·min1, q1 + e1+ ∆2(x1) ·min1, q2 + e2).

6.3.4 Objective function

Both F4C2 and I2F2C2 are dealing with four queues, so the objective function is

c(s) = q1 + q2 + q3 + q4.

6.4 Value Iteration through backward induction

Bellman introduced the term “backward induction” and also indicated the way to solve(6.4) using fixed point theory, where (6.4) is repeated iteratively. The latter process is

Page 100: High Performance Computing in Deterministic Global ...

86 Dynamic Programming applied to traffic control

called Value Iteration [79]. There are several ways to describe the iterative process. Ourdescription takes an algorithmic viewpoint opposed to the more conventional mathematicalnotation in [79]. The exact process, which we aim to parallelize in Chapter 7, is describedin Algorithm 6.1. The algorithm uses two iterates, called current valuation vector Y andformer valuation vector W used to determine a new valuation Y according to

Yj = c(sj) + minx∈X (sj)

i

piWk, j = 1, . . . , |S|, (6.9)

where k is the state index related to state T (x, sj, ei). Both iterates are aimed to convergeto an optimal valuation vector V , leading to convergence towards d = Yj − Wj , for allj = 1, . . . , |S|. Convergence to the scalar d is measured by the so-called

span(Y, W ) = maxj

(Yj −Wj)−minj

(Yj −Wj). (6.10)

The iterative procedure of Algorithm 6.1 stops whenever span(Y, W ) is smaller than a pre-specified accuracy ǫ indicating the accuracy in estimating d.

Algorithm 6.1 Value Iteration by Backward Induction

1: Set vector Y to zero2: repeat3: Copy vector Y into vector W4: for j = 1, . . . , |S| do5: Yj = c(sj) + minx∈X (sj)

i piWk

6: end for7: until maxj(Yj −Wj)−minj(Yj −Wj) < ǫ

How is this property translated into the TCT model? In the first place, notice that thestrategy x(s) represents the idea of a TCT. In the model, the set of states S is countable asit is based on the light colours and the number of waiting vehicles. In addition, a maximumnumber of vehicles in a queue is defined, such that the total number of states becomes finite.The value of states beyond the limit Qf are obtained by linear extrapolation. The functionv(s) is represented by vector V and x(s) is the final TCT given by (6.5). The computationalchallenge is that |S| is usually high. An important observation for the defined model is thatthe computation of Yj requires a small index subset Kj for values Wk of the index set Sj .

The research question is how to handle the retrieval of Wk values from memory in thecomputational process, such that the loop over j in Algorithm 6.1 can be done in an efficientway.

6.5 Evaluation

The test-bed is divided in three sets, resulting in a total of 44 experiments. The convergenceaccuracy ǫ is set to 0.01. This value is used in the termination criterion, defined in (6.10).The aim of these three sets is to show the computational effort to solve simple cases. Aside question, solved in [35], is the impact of having arrival information on the model.

Set 1 relates to the F4C2 intersection. Within this set, there are four subsets with differentvalues of incoming traffic. The first three subsets have a symmetric arrival rate whileSubset 1d has asymmetric arrival rate.

Page 101: High Performance Computing in Deterministic Global ...

6.6 Summary 87

Subset 1a Arrival rate λf = 0.4 for the external flows f = 1, 2, 3.Subset 1b Arrival rate λf = 0.3 for the external flows f = 1, 2, 3.Subset 1c Arrival rate λf = 0.2 for the external flows f = 1, 2, 3.Subset 1d Arrival rate λ1 = λ3 = 0.2, and λ2 = λ4 = 0.4.

Set 2 relates to the I2F2C2 arterial. All traffic is ongoing or through, so ρ = 0.Set 3 is similar to Set 2 but the fraction of vehicles that turn right is set to ρ = 0.5.

Table 6.1 shows the numerical results for Set 1 (F4C2 intersection), while Table 6.2shows the numerical results for Sets 2, and 3 (I2F2C2 infrastructure). Tables 6.1 and 6.2are vertically divided in three parts. The first part is related to the maximum size Qf foreach queue f and the number of slots in the arrival information (M). The second partshows the memory consumption in terms of the number of states for each experiment andthe amount of memory to store the current and the previous value for each state (vectorsY , and W , respectively). The number of states is determined by (6.6) and (6.7). The valuefor each state is stored in main memory as double precision (eight bytes). The third partrelates to the results obtained by executing the algorithm: the average waiting time of avehicle, the number of iterations to reach the desired accuracy ǫ, and the wall-clock timein seconds. Instances n. 8, 34, and 40 have been solved in parallel using the techniquespresented in Chapter 7. The above-mentioned instances have not been solved in sequential,because the memory requirement and the elapsed time are excessive.

Regarding Set 1, a reduction in the average waiting time is obtained as the number ofarrival information slots increases. The inclusion of arrival information reduces the averagewaiting time up to 13% in Subset 1c. The higher the arrival rate of the external flows, themore time vehicles have to wait to leave the intersection. Vehicles wait between 10–11 timeunits when the arrival rate is λ = 0.4 for all the external lanes, whereas they wait between1–2 time units when λ = 0.2. In addition, the number of iterations to obtain the TCTs isreduced if the arrival rate is small. There is an explosion of memory consumption when Mincreases from 10 to 15 slots of arrival information.

Regarding the I2F2C2 infrastructure, the time to evaluate a state from this infrastruc-ture is longer than the time to evaluate a state from F4C2 infrastructure. The difficultyto calculate the transition increases as incoming vehicles depend on factors like ρ and theoutput of the left junction (see Figure 6.1b). When there exists the possibility of a vehicleturns right, the average waiting time increases, because it is more difficult to predict thebehaviour of a vehicle compared to the situation where all vehicles follow an ongoing direc-tion. Comparing Set 2 and 3, the model behaviour is less predictable when the turning-rightpercentage is ρ = 0.5. Therefore, the addition of arrival information to the model has apositive impact on reducing the waiting time for Set 3.

6.6 Summary

The dynamic control of traffic lights using information on the queue lengths and on esti-mated arrival rates of incoming vehicles, can be formulated as an MDP. Additionally, themodel can be complicated if we consider information about the number of vehicles thatarrive at a lane in a specific time slot. Controlling the traffic lights in a network of intersec-tions is a challenging task. The computational burden is high to solve realistic cases wherethe arrival information is big enough. This chapter showed how MDP can contribute toconstruct new TCTs that approximately minimize the long-run average waiting time (andthus reducing queuing delay).

Page 102: High Performance Computing in Deterministic Global ...

88 Dynamic Programming applied to traffic control

Table 6.1: Numerical results for the F4C2 infrastructure

N. Q1 Q2 Q3 Q4 M N. States Mem. Avg. Iter. Time

Subset 1a (λf = 0.4, f = 1, 2, 3)

1 14 13 16 14 0 374,850 5.72 MB 10.84 468 1422 15 14 15 16 1 913,920 13.95 MB 10.79 485 3703 14 14 14 14 2 1,417,500 21.63 MB 10.72 462 6484 15 15 14 15 3 3,440,640 52.50 MB 10.44 480 1,6765 13 15 17 13 4 6,322,176 96.47 MB 10.40 483 3,2926 15 13 15 14 5 12,042,240 183.75 MB 10.39 469 6,1977 14 14 14 13 10 338,688,000 5.05 GB 10.27 460 199,2318 15 18 16 12 15 15,410,397,184 228.63 GB 10.13 507 314,815*

Subset 1b (λf = 0.3, f = 1, 2, 3)

9 8 8 8 8 0 45,927 0.70 MB 4.164 109 410 8 8 8 7 1 81,648 1.25 MB 4.114 108 711 8 7 8 8 2 163,296 2.49 MB 4.051 110 1812 7 9 8 6 3 282,240 4.31 MB 3.877 112 3313 8 8 7 6 4 508,032 7.75 MB 3.865 108 6014 8 9 8 7 5 1,451,520 22.15 MB 3.853 115 18515 8 9 9 6 10 45,158,400 689.06 MB 3.751 120 6,94316 7 8 7 6 15 924,844,032 13.78 GB 3.727 109 154,886

Subset 1c (λf = 0.2, f = 1, 2, 3)

17 5 5 5 5 0 9,072 0.14 MB 1.941 89 118 5 5 5 5 1 18,144 0.28 MB 1.899 96 119 5 5 5 5 2 36,288 0.55 MB 1.864 105 420 6 6 6 4 3 96,040 1.47 MB 1.760 73 721 6 6 6 4 4 192,080 2.93 MB 1.748 55 1122 5 5 5 4 5 241,920 3.69 MB 1.742 56 1523 5 5 5 4 10 7,741,440 118.13 MB 1.691 64 65024 5 5 6 4 15 289,013,760 4.31 GB 1.681 71 32,603

Subset 1d (λ1 = λ3 = 0.2, and λ2 = 0.4)

25 7 7 7 8 0 32,256 0.49 MB 3.865 100 326 8 8 8 7 1 81,648 1.25 MB 3.841 105 727 8 7 10 8 2 199,584 3.05 MB 3.773 114 2228 7 8 6 6 3 197,568 3.01 MB 3.528 100 2029 7 8 8 7 4 580,608 8.86 MB 3.512 108 6430 7 8 8 7 5 1,161,216 17.72 MB 3.507 109 13431 8 8 7 7 10 37,158,912 567.00 MB 3.454 116 5,71532 7 8 7 6 15 924,844,032 13.78 GB 3.431 107 157,673

* Elapsed time with 7 MPI processes and 16 threads (see Chapter 7).

Page 103: High Performance Computing in Deterministic Global ...

6.6 Summary 89

Table 6.2: Numerical results for the I2F2C2 infrastructure

N. Q1 Q2 Q3 Q4 M N. States Mem. Avg. Iter. Time

ρ = 0, and λ1 = λ2 = λ3 = 0.4

33 12 12 12 12 5 44,783,648 683.3 MB 6.863 348 24,29534 15 14 14 14 10 2,709,504,000 40.37 GB 6.874 418 215,251*

ρ = 0, and λ1 = λ2 = λ3 = 0.3

35 7 8 7 6 5 6,322,176 96.47 MB 2.771 104 1,03936 7 8 7 7 10 231,211,008 3.45 GB 2.755 111 53,163

ρ = 0, and λ1 = λ3 = 0.2, and λ2 = 0.4

37 6 6 7 7 5 4,917,248 75.03 MB 2.514 103 79238 6 7 7 7 10 179,830,784 2.68 GB 2.51 101 37,597

ρ = 0.5, and λ1 = λ2 = λ3 = 0.4

39 12 12 14 12 5 51,673,440 788.47 MB 7.902 372 97,57840 11 12 15 14 10 1,878,589,440 27.99 GB 7.772 396 84,096†

ρ = 0.5, and λ1 = λ2 = λ3 = 0.4

41 7 8 7 6 5 6,322,176 96.47 MB 3.254 104 3,42642 7 8 7 6 10 202,309,632 3.01 GB 3.149 110 140,937

ρ = 0.5, and λ1 = λ3 = 0.2, and λ2 = 0.4

43 7 8 6 5 5 4,741,632 72.35 MB 2.577 87 2,16544 7 7 6 5 10 134,873,088 2.01 GB 2.524 87 74,184

* Elapsed time with 8 MPI processes and 16 threads (see Chapter 7).† Elapsed time with 16 MPI processes and 16 threads (see Chapter 7).

Page 104: High Performance Computing in Deterministic Global ...

90 Dynamic Programming applied to traffic control

The new dynamic control policy can be used both with and without information onpredicted arrival times. The contribution of having arrival information for a number oftime units ahead has been investigated. From a simulation study of so-called F4C2 andI2F2C2 infrastructures with different workloads, we can conclude that arrival informationis useful to reduce the average waiting time of vehicles, even when arrival information inF4C2 is available for M = 1 only.

When the number of slots of arrival information are greater or equal than ten (M ≥ 10),the algorithm requires the use of a computer with a large RAM memory space. Sometimes,the memory requirement exceeds the capacity of a supercomputer node, necessary to carryout the instance on several nodes. For these cases, a parallel approach is needed.

Page 105: High Performance Computing in Deterministic Global ...

CHAPTER

7Determination of Traffic

Control Tables in parallel

The parallelization of the Value Iteration method for this problem is discussed froma data management perspective. A suitable partition of the state space is convenient todivide the workload between process units properly, obtaining a good speedup when parallelcomputation is used [90]. The investigated infrastructures can be a basis for more complexones, allowing the possibility to apply the parallel approach developed here.

7.1 Parallel models

The sequential Algorithm 6.1 is an iterative process where the approximation of a vectorV by vector Y of size |S| is updated based on its previous values, stored in vector W , untiltermination condition (6.10) holds. The set of state values S can be partitioned in differentways depending on |S| as well as the possible clustering of states, and on the characteristicsof the architecture at hand. When a message passing model is used, the number of messagesshould be as small as possible. For shared-memory architectures, the access to values in Wshould be done in such a way that the probability they are available in cache level is high[57].

DP models can be classified according to the recurrence function [31]. If the equationcontains one recursive term, such a formulation is called monadic. If the recursive functionhas multiple recursive terms, the DP formulation is called polyadic. The dependencesbetween stages can be classified as serial and non-serial. If the solutions to subproblemsat a stage depend only on solutions to problems at the previous level, the formulation iscalled serial. Otherwise, it is called non-serial. This classification is useful since it identifiesconcurrency and dependencies that guide parallel formulations.

In this particular case, the recurrence relation (6.3) is monadic and non-serial. Thelatter implies that there is no need to store the values for each stage. Therefore, Algorithm6.1 only stores values at stages n and n− 1 by a “current stage” vector Y and an “former

91

Page 106: High Performance Computing in Deterministic Global ...

92 Determination of Traffic Control Tables in parallel

stage” vector W . The calculation at a stage n depends on the previous calculations at stagen− 1. This characteristic is inherited from Markov Decision Processes.

Two main approaches can be found for parallelizing DP algorithms [1]. The first ap-proach is calculating the next stage in parallel. The number of states is usually greaterthan the number of parallel tasks, so each task is in charge of a chunk of the state space.The second approach consists of dividing the number of stages between parallel tasks. Inthis way, a pipeline is created. One of the disadvantages of the second approach is themanagement of the state dependency, that makes the scalability of the code a challenge.Here, we have developed the first approach.

In many Value Iteration algorithms, a maximum number of iterations is used as astopping criterion. The stopping criterion defined as (6.10) requires more calculation at theend of each iteration. The value of the maximum and minimum difference Yj −Wj for eachchunk must be interchanged between the parallel tasks in order to calculate the span of thecomplete state space according to (6.10). A synchronization point at the end of the updateof Y is needed to check the termination condition and to have the data W available for thenext iteration. This is an additional challenge for the parallelization of the process, becausethe synchronization deteriorate the performance considerably [68].

Consider the set of possible actions as outlined by (6.8) for the F4C2 case. The size ofKj is not the same for all the states due to the fact that the size of the action sets is varying,it depends on the light. One can take this specific feature into account in the division ofthe workload by distinguishing the traffic states (q and a) from the light state l. One wayto consider this distinction is mapping the state space graphically into a matrix where thecolumns correspond to the values of lights and the rows are the different combinations ofqueue states and arrival information. In this way, every row has the same amount of workif we neglect the extrapolation of states where any qj = Q, j ∈ F . For the first case, onecan depict the matrix-wise state space as

S =

s1 s2 · · · s7

s15 s13 · · · s14

......

. . ....

sk sk+1 · · · sN

(7.1)

The order in which the states are evaluated plays an important role for two reasons:work partitioning and memory access. Let us take as example the F4C2 infrastructure,instance n. 17 (described in Chapter 6). The number of states for this instance is equalto 9,072. Figure 7.1a represents the size of |Kj | for the instance n. 17, i.e., the memoryaccess necessary to determine the value of Yj depending on the state j when the states areordered according to the light (column-wise in matrix S). The number of values requiredto calculate a value of Y varies: |K| ∈ 16, 24, 28, 30, 31, 32, 40, 44, 48, 52, 56. The orderingof the states has been obtained by iterating first with the light state l. The ordering of thestates in the vector is of great importance if the workload is divided by chunking the stateson the x-axis of Figure 7.1a.

Figure 7.1b represents the states in Kj of which values Wk are needed to calculateYj , j = 1, . . . , 9,072. Figure 7.1b exhibits an homogeneous retrieval of Wk values over theconsecutive states j.

Two parallel models are discussed below. Both approaches make use of partitioning ofthe workload.

Page 107: High Performance Computing in Deterministic Global ...

7.1 Parallel models 93

15

20

25

30

35

40

45

50

55

60

1296 2592 3888 5184 6480 7776 9072

Siz

e of

K

State index

(a) Size of Kj

1296

2592

3888

5184

6480

7776

9072

1296 2592 3888 5184 6480 7776 9072

W i

ndex

Y index

(b) Value dependences

Figure 7.1: State ordering for instance n. 17

7.1.1 Shared-memory approach

In a shared-memory architecture, the complete matrix of state values W is visible to allthreads. Therefore, a message passing step is not needed. Each thread is in charge of achunk of the matrix Y . Therefore, no conflict (or memory contention) occurs in updatingthe value of Y . Specifically, now we divide the matrix row-wise into chunks. In this way,the aim is to have a more-balanced workload and better scalability than when dividing theset of states in columns. At the end of each iteration, a barrier must be set in order tocalculate the stopping criterion and copy Y into W for the next iteration. This approachcan be used when the memory requirement of the instance is less than the available memoryin the shared-memory node.

7.1.2 Distributed-memory model

In this section, a hybrid version using MPI and Pthreads is presented. Two approaches arediscussed here to develop the parallelization of the algorithm by a message passing scheme.The states can be clustered in several ways. Ordering the states like in Equation (7.1), onecan divide the matrix column-wise or row-wise.

In [46], the state space matrix (7.1) was dealt with in a column-wise parallelizationusing a message passing model. This approach attempts to minimize message interchangesbetween the MPI processes and to obtain low memory requirement for each MPI process.The main drawback of this approach is that the number of MPI processes is fixed andtherefore hinders scalability when the number of available cores is high. Figure 7.2 showsthe dependences between the light states for F4C2. The figure depicts in a dynamic way theaction spaces of (6.8). It can be seen that red or green light states require more informationabout previous values than yellow light states.

If each process has a part of the matrix to compute the TCT, the file that contains theTCT can be generated in parallel using MPI I/O. Other parallel I/O libraries exist in theliterature, but this issue is out of the scope. For cases like instance n. 8 (introduced inChapter 6), the memory of a node cannot allocate the entire vector W .

Here, we also investigate the feasibility of a parallelization where the states are orderedrow-wise in the state matrix (7.1). The maximum and minimum difference Yj−Wj , requiredto calculate the stopping criterion (6.10), is shared using the MPI_Reduce procedure. Thematrix W is shared among MPI processes using the MPI_Allgatherv procedure as the

Page 108: High Performance Computing in Deterministic Global ...

94 Determination of Traffic Control Tables in parallel

Red

1

2

3

4

5

6

7

Y1F1

Y2F1

GF1 GF2

Y1F2

Y2F2

Figure 7.2: Message passing in lights (column-wise) distribution for F4C2

Table 7.1: Speedup of Pthread and MPI-Pthread versions of TCT generation

Pthreads Hybrid (row-wise) Hyb. (col-wise)2 4 8 16 2 (32) 4 (64) 8 (128) 7 (112)

7 1.9 3.0 6.0 12.0 17.6 18.4 18.7 35.016 1.7 2.9 5.8 11.6 16.8 21.2 21.7 37.524 1.8 3.0 6.0 12.0 17.4 20.5 22.3 39.032 1.9 3.0 5.7 12.1 17.4 20.8 22.9 38.2

36 1.9 3.0 6.1 11.1 15.0 20.4 24.9 –38 1.9 3.0 6.0 11.1 15.1 20.2 24.1 –42 1.8 2.8 5.7 10.3 16.0 27.4 42.5 –44 1.8 2.8 5.6 10.2 16.0 27.4 42.6 –

number of states are not always the same when they are divided among cores. For caseslike instance n. 34, MPI_Allgatherv cannot be used, because the count parameter is aninteger and the number of states is bigger than the maximum value of an integer. For theselarge cases, the message is divided into chunks and sent using MPI_Bcast function. EachMPI process can evaluate its corresponding chunk in parallel with the use of Pthreads.

7.2 Experimental results

A subset of the instances described in Chapter 6 are considered for testing the differentapproaches discussed in the previous section. For each subset, the most memory-consuminginstance has been considered, except for Subset 1a, where memory consumption of instancen. 8 exceeds the memory capacity of a node. In this case, instance n. 7 has been selected.Following the same criteria, several problems with a medium memory requirement has beenselected from Sets 2 and 3.

Table 7.1 shows the scalability in terms of speedup obtained by the shared-memorymodel (Pthreads) and distributed-memory model (a combination of MPI and Pthreads).Each MPI process works with 16 threads. The two MPI approaches are studied for theF4C2 instances. I2F2C2 has been solved by the row-wise approach.

Page 109: High Performance Computing in Deterministic Global ...

7.3 Summary 95

The threaded version does not achieve a linear speedup due to the synchronization pointat the end of each iteration. This synchronization between all threads is hard when thenumber of threads is big, as Table 7.1 reveals. The MPI version shows a poor speedup dueto the overhead of message interchange. As in the threaded version, the synchronization atthe end of each iteration is a drawback to achieve a good speedup. An MPI profiler revealedthat the overhead is about a third part of the execution time [46].

Comparing the row-wise and the column-wise versions, the speedup provided by thecolumn-wise version is better than the speedup of the row-wise version. This is due to thereduction in the data transfer between the nodes.

To show the computational burden that this model can generate, instance n. 8 is solvedusing the column-wise approach. Numerical results were shown in Table 6.1 of Chapter 6.This case illustrates that the aim of the parallel computing is not only the calculation ofthe results in less time. In some cases, the memory requirement is large enough to need thememory of several computing nodes.

7.3 Summary

Solving dynamic optimization problems by the Value Iteration algorithm is a challenge fromthe point of view of running the process in parallel. Each problem is characterized by itsdetails that make the solution procedure different from other instances. We have dealt withthe generation of Traffic Control Tables by applying a Value Iteration algorithm based onBackward Induction. Each iteration of the algorithm requires a synchronization. Due toBellman’s curse of dimensionality, a small change in the infrastructure or an increment inqueue or arrival information size increases the total state space considerably. A study ofthe search space reveals that it is important how the states are ordered if one wants topreserve the cache locality and workload balance. From the High Performance Computingperspective, a suitable ordering of the state space facilitates the initial workload balancewhen work is scattered among processes or threads. A near linear speedup has been obtainedwith a threaded version for a medium size instance of the problem. An MPI approachappeared less promising than the threaded version. Complex infrastructures with moreflows can benefit from the parallel approach developed in this chapter.

Page 110: High Performance Computing in Deterministic Global ...
Page 111: High Performance Computing in Deterministic Global ...

CHAPTER

8Conclusion

Along this work we have described in detail and empirically evaluated the main contri-butions of the present dissertation. Now it is time to summarise the findings with respectto the investigated research questions. In order to study the ability of High PerformanceComputing to handle hard-to-solve optimization problems, several cases have been investi-gated. A total of three different problems have been studied with respect to their solutionby deterministic methods. The first two are handled by the Branch and Bound (B&B)method, while the last one concerns Dynamic Programming. The challenge in the first oneswas the irregular character of the implied search tree. For the last problem, the handlingof large data structures supposed a challenge from the computational viewpoint.

8.1 Discussion of the contributions

The first problem that has been investigated is called the bi-blending problem. Designingtwo products with raw material scarcity involves a concurrent B&B search in two disjointfeasible areas, one per product. Lower bounds on raw material usage must be sharedbetween the two products. New rejection tests, designed specifically for this problem, havebeen developed. These tests take the demand of both products and the availability of rawmaterial into account. This search can easily be performed in parallel by two processingunits, each of them would be in charge of one product. The search becomes complicatedwhen the number of raw materials involved on the products to be designed is greater thanfive. For these cases, several processing units could be dedicated for each product.

The shared lower bound on raw material availability is not as accurate as one desires.This weakness leads to a great number of subspaces in the final lists that must be removed.Several strategies have been analysed to do this filtering in parallel. Filtering one list firstand then the other one offers the best performance.

The second investigated case is that of multidimensional Global Optimization. Trivialtasks like the division of a search space can create a big impact in the amount of compu-tational work done by the search algorithm. The Longest Edge Bisection (LEB) strategydivides the search space by its longest edge. When the subspace contains more than one

97

Page 112: High Performance Computing in Deterministic Global ...

98 Conclusion

edge with the maximum length, the tree traversal depends on which longest edge the al-gorithm selects. The investigated LEBC strategy based on the distance from the subspacecentroid to the middle point of the longest edges, offers the best performance. The use ofthis strategy generates a search tree smaller than the rest of the strategies analysed in thisinvestigation.

Another aspect taken into account was the selection rule, i.e., which element will beprocessed next. In this investigation, a hybrid selection rule (combination of best-first anddepth-first search) reveals a large memory requirement in the experiments. The use ofa depth-first search selection rule leads to a low memory requirement, but this selectionstrategy may steer the search towards a non-optimal subspace for some time. The SearchOverhead Factor (SOF) is similar for both selection rules when the problem to be solvedhas a low accuracy. For higher accuracy, depth-first search shows a non-decreasing SOFfor a few instances, suffering from detrimental anomalies in the search. Nevertheless, thewall-clock time for the depth-first search method is better than the hybrid one for all testedcases.

In general, the curse of dimensionality limits B&B types of approaches to solve manypractical problems up to a guaranteed accuracy. Investigation in mathematical modellingprovides possibly better lower bounds, generating at least theoretical improvements. Inthe end, algorithms are run on a computer where, as it is well known, selection rulesprovide a heuristic element in the efficiency, turning problems from unsolvable into solvable.This research has shown that computer science aspects about the organization of orderedstructures in a computer affect execution times considerably.

One of the investigated questions was which models are appropriate to efficiently mapB&B algorithms on a distributed computer architecture. It is important to take the ar-chitecture of the parallel machine into account to determine the level of parallelism of thealgorithm. Parallel B&B algorithm performance depends on the computational architec-ture on which the algorithm is run. A bi-level parallelization approach has been developedusing PThreads for parallelism within a node and MPI for communications beyond a node.This hybrid approach will generally produce code with better scaling properties than a pureMPI approach. Due to the characteristics of each programming model, an MPI process ismapped by node. Each MPI process comprises as many threads as number of cores thenode has. On intra-node level, one can perform dynamic load balancing by generatingthreads dynamically. Threads finish when their assigned work is completed. Other shared-memory models have been studied like the task-based TBB. The dynamic load balancingby work-stealing has a good behaviour even when the task is not time-consuming. Using aframework like Bobpp, we can conclude that the number of working sets handled in parallelplays an important role in the performance. Usually, having multiple working sets givesbetter performance results than having a single list due to the memory contention problem.However, a working set shared by two threads gives the best performance for a large numberof working threads. The inter-node level can carry out static or dynamic load balancingusing MPI. The upper bound, essential to characterize and possibly reject the next simplex,can be shared between the MPI processes. Experimentation reveals a better performance ifdynamic load balancing is used and the upper bound is shared, compared with the versionwithout any communication between the MPI processes (nodes).

The third investigated case is the minimization of the expected waiting time per vehicleat a traffic light. The traffic at an intersection can be dynamically controlled by TrafficControl Tables. This problem is formulated as a Markov Decision Process. In order toinvestigate this case, we dealt with the application of the Value Iteration algorithm instochastic Dynamic Programming. High Performance Computing facilitates solving large

Page 113: High Performance Computing in Deterministic Global ...

8.2 Future lines of research 99

Markovian decision problems in a deterministic way. The curse of dimensionality leads toa large memory consumption. An efficient management of the memory space is advisableto reduce the execution time. Having additional information of the number of vehiclesthat are coming to a lane increases the number of states of the problem exponentially.Experimentation reveals an improvement of the overall waiting time of vehicles in a trafficjunction.

The Value Iteration algorithm, as the name indicates, is an iterative algorithm whichstops when the solution reaches an imposed accuracy. In the parallel version of the algo-rithm, a synchronization point is mandatory at the end of each iteration. This affects theperformance of the algorithm negatively, because the threads have to wait for the slowestthread in each iteration.

The algorithm needs the previous value states to calculate the current ones. In a dis-tributed memory system, this supposes a challenge, because the whole set of states must besent between the nodes. A large data transmission between the nodes causes large delays,reducing the performance. Studying the nature of the problem, one can observe that onlya subset of the whole set of states is required to calculate the current state values for agiven traffic light state. Partitioning the set into subsets according to the traffic light statepermits the algorithm to solve problems that are intractable with the method which sharesthe whole state space. The main drawback of the light division approach is the scalability,which depends on the number of light states.

In general, algorithm design depends on the characteristics of problems to be solved.Nevertheless, the findings obtained along this investigation can be extrapolated to otherproblems.

8.2 Future lines of research

The mixture design problem for several products, the multi-blending problem, can be ad-dressed as a future research. Regarding the parallel computation, the development of anew version can be tackled in order to experiment with larger-dimensional problems for theparallel bi-blending algorithm, trying to decrease the computational cost. Another futureresearch question is to develop the parallel version of the multi-blending algorithm, whichis a problem of interest to the industry.

Numerical experiments show that the computer memory requirement of this type ofalgorithms increases drastically with the dimension of the products. Future research willfocus on the reduction of the current memory requirement and on the use of distributedcomputing, in order to alleviate this problem.

For future work, it would be interesting to develop and investigate the effect of a selectionrule with a low memory requirement that also provides a low search overhead in orderto improve performance of the sequential an parallel algorithms. In addition, it will beinteresting to develop a LEB heuristic which does not need to calculate the length of theedges. This would speed up the execution time considerably since a great percentage ofthis time is spent doing these calculations.

High Performance Computing is a developing area where new architectures and newprogramming paradigms are released with a certain frequency. The task-based paradigmgives a good performance even for irregular algorithms. A hybrid version using MPI and atask-based approach as TBB could be of interest. The use of a dynamic number of threadsin future TBB auto-tuned versions is an appealing approach to be studied as well as thedecision when to use more than one thread per queue. Additionally, an interesting future

Page 114: High Performance Computing in Deterministic Global ...

100 Conclusion

research question is the effect of the use of a cut-off to limit the parallelism until certainlevel of the search tree in order to reduce the parallel overhead in TBB and Pthreadedversions.

Dynamic Programming implies the processing of a big data structure, applying therecurrence function to each state value stored as an element in the data structure. Thistype of processing corresponds to the many-core paradigm, where many elements can beprocessed in parallel. Another possibility is the use of new programming paradigms like TBBto solve this approach. Regarding the distributed computation, other partition techniquesshould be studied in order to reduce and optimize the data transmission between the nodes.In this way, future investigation should focus on parallel computation of Traffic ControlTables for more complicated infrastructures that imply a larger number of traffic states.

Page 115: High Performance Computing in Deterministic Global ...

Appendices

101

Page 116: High Performance Computing in Deterministic Global ...
Page 117: High Performance Computing in Deterministic Global ...

APPENDIX

AProducts

Case1

Dimension = 3; Raw materials involved = RM1, RM2, RM4;Raw material cost = (0.1, 0.7, 4.0);Linear constraints:h1(x) = −1.5x1 + 0.5x2 − 0.5x3 ≥ 0.0h2(x) = 0.3x1 − 0.5x2 − 0.3x3 ≥ 0.0Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 1, 2

)

.

Case2

Dimension = 3; Raw materials involved = RM1, RM2, RM3;Raw material cost = (0.1, 0.7, 1.0);Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 3, . . . , 7

)

.

Case3

Dimension = 3; Raw materials involved = RM1, RM2, RM4;Raw material cost = (0.1, 0.7, 4.0);Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 7, . . . , 10

)

.

Case4

Dimension = 3; Raw materials involved = RM2, RM3, RM4;Raw material cost = (0.7, 1.0, 4.0);Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 2, 7, 9, 11

)

.

103

Page 118: High Performance Computing in Deterministic Global ...

104 Products

Uni5Spec1

Dimension = 5; Raw materials involved = RM5, RM6, RM7, RM8, RM9;Raw material cost = (114, 115, 107, 127, 115);Linear constraint:h3(x) = 0.1493x1 + 0.6927x2 + 0.4643x3 + 0.7975x4 + 0.5967x5 ≥ 0.35Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 12, . . . , 14

)

.

Uni5Spec5b

Dimension = 5; Raw materials involved = RM5, RM6, RM7, RM8, RM9;Raw material cost = (114, 115, 107, 127, 115);Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 15, . . . , 18

)

.

Uni7Spec1

Dimension = 7; Raw materials involved = RM5, RM6, RM7, RM8, RM9, RM10, RM11;Raw material cost = (114, 115, 107, 127, 115, 106, 108);Linear constraint:h4(x) = 0.1493x1 +0.6927x2 +0.4643x3 +0.7975x4 +0.5967x5 +0.6235x6 +0.5284x7 ≥ 0.35Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 19, . . . , 21

)

.

Uni7Spec5b

Dimension = 7; Raw materials involved = RM5, RM6, RM7, RM8, RM9, RM10, RM11;Raw material cost = (114, 115, 107, 127, 115, 106, 108);Quadratic constraints

(

gi(x) = xT Aix + bTi x + di ≤ 0; i = 22, . . . , 25

)

.

Quadratic constraints

A1[3× 3] = (0,−16, 0,−16, 0, 0, 0, 0, 0); b1[3 × 1] = (8, 8, 0); d1 = −1A2[3× 3] = (10, 0, 2, 0, 0, 0, 2, 0, 2); b2[3× 1] = (−12, 0,−4); d2 = 3.7A3[3× 3] = (0.001,−0.001, 0.0085,−0.001, 0.008,−0.0105, 0.0085,−0.0105,−0.021)b3[3× 1] = (−0.0145,−0.0205, 0.073); d3 = −0.0165A4[3× 3] = (−0.004, 0.0005, 0.002, 0.0005,−0.001,−0.003, 0.002,−0.003, 0.014)b4[3× 1] = (0.0155, 0.0515,−0.121); d4 = −0.006A5[3×3] = (20.605,−5.087,−10.9885,−5.087, 32.003,−43.476,−10.9885,−43.476,−81.278)b5[3× 1] = (0.1995,−0.097, 126.7685); d5 = −20.5063A6[3× 3] = (0.766,−0.1205, 2.4735,−0.1205, 0.528, 1.9835, 2.4735, 1.9835,−7.822)b6[3× 1] = (−2.432,−15.191, 10.712); d6 = 3.21125A7[3× 3] = (116.75,−3.09, 168.553,−3.09,−67.424, 515.114, 168.553, 515.114,−845.215)b7[3× 1] = (−287.43,−645.926, 354.537); d7 = 115.0953A8[3× 3] = (1.0, 3.0,−0.5, 3.0,−5.0,−3.5,−0.5,−3.5,−2.0)b8[3× 1] = (0.832, 0.832, 0.832); d8 = 0.968A9[3× 3] = (2.0,−1.5, 1.0,−1.5, 1.0,−1.0, 1.0,−1.0, 3.0)b9[3× 1] = (0.12, 0.12, 0.12); d9 = −1.60

Page 119: High Performance Computing in Deterministic Global ...

105

A10[3× 3] = (4.0,−1.5,−1.5,−1.5, 4.0,−2.5,−1.5,−2.5, 4.0)b10[3 × 1] = (−0.026,−0.026,−0.026); d10 = −2.141A11[3× 3] = (4.0,−1.0,−2.0,−1.0, 5.0,−3.0,−2.0,−3.0, 4.0)b11[3 × 1] = (−0.019,−0.019,−0.019); d11 = −2.631A12[5 × 5] = (−1.473, 8.215, −27.204, 46.119, 2.059, −11.929, −12.768, 8.215, 37.733346,5.127, 95.691, 34.954, 20.165, 19.445,−27.204, 5.127,−21.743, 36.843,−7.126, 4.029,−4.152,46.119, 95.691, 36.843, 189.643, 93.359, 52.904, 54.802, 2.059, 34.954,−7.126, 93.356, 31.885,7.528, 10.248,−11.929, 20.165, 4.029, 52.904, 7.528, 11.951, 10.964,−12.768, 19.445,−4.152,54.802, 10.248, 10.964, 7.197)b12[5 × 1] = (4.5675, 34.7289, 70.5707, −82.2761, 29.3169); d12 = −35A13[5 × 5] = (1.35, −4.41, 17.60, −92.45, 2.74, −29.94, −14.05, −4.41, −39.13, −6.11,−126.38,−29.81,−63.42,−43.97, 17.60,−6.11, 15.45,−76.60, 5.93,−44.05,−20.54,−92.45,−126.38,−76.60,−240.64,−117.46,−125.18,−114.98, 2.74,−29.81, 5.93,−117.46,−22.90,−47.37,−30.68,−29.94,−63.42,−44.05,−125.18,−47.37,−73.39,−73.99,−14.05,−43.97,−20.54, −114.98, −30.68, −73.99, −55.33)b13[5 × 1] = (−2.1232, −9.0403, −42.2072, 190.5292, −9.9529); d13 = 10A14[5× 5] = (−0.670, 4.284, −12.837, 23.708, 1.677, −8.964, −4.859, 4.284, 21.380, −1.188,28.990, 13.216, 17.177, 16.620, −12.837, −1.189, −21.376, 9.841, −7.298, −10.043, −8.981,23.708, 28.990, 9.841, 49.385, 25.574, 15.561, 21.666, 1.677, 13.216, −7.298, 25.574, 8.419,4.149, 6.595, −8.965, 17.177, −10.043, 15.561, 4.149, 1.090, 6.292, −4.859, 16.620, −8.981,21.666, 6.594, 6.292, 5.906)b14[5 × 1] = (0.7097, −13.0982, 27.5078, −49.1608, −7.3725); d14 = −2A15 = −A12; b15 = −b12; d15 = 45A16 = −A13; b16 = −b13; d16 = −21A17[5 × 5] = (0.0, −11.556, −1.114, 14.690, −11.411, 0.121, −0.150, −11.556, −3.316,−2.116, 7.313, −8.800, 19.897, 9.051, −1.114, −2.116, 4.728, 16.250, −4.535, 18.319, 11.537,14.690, 7.313, 16.250, 40.428, 9.766, 21.512, 15.266,−11.412,−8.800,−4.535, 9.766,−10.165,10.088, 1.889, 0.121, 19.897, 18.319, 21.511, 10.088, 28.569, 27.239, −0.150, 9.051, 11.537,15.266, 1.889, 27.239, 19.965)b17[5 × 1] = (1.7278, 23.5166, 5.6724, −32.0798, 19.0154); d17 = −5A18 = A14; b18 = b14; d18 = −1A19[7 × 7] = (−1.473, 8.215, −27.204, 46.119, 2.059, −11.929, −12.768, 8.215, 37.733346,5.127, 95.691, 34.954, 20.165, 19.445,−27.204, 5.127,−21.743, 36.843,−7.126, 4.029,−4.152,46.119, 95.691, 36.843, 189.643, 93.359, 52.904, 54.802, 2.059, 34.954,−7.126, 93.356, 31.885,7.528, 10.248,−11.929, 20.165, 4.029, 52.904, 7.528, 11.951, 10.964,−12.768, 19.445,−4.152,54.802, 10.248, 10.964, 7.197)b19[7 × 1] = (4.5675, 34.7289, 70.5707, −82.2761, 29.3169, 71.0818, 63.7614); d19 = −35A20[7 × 7] = (1.35, −4.41, 17.60, −92.45, 2.74, −29.94, −14.05, −4.41, −39.13, −6.11,−126.38,−29.81,−63.42,−43.97, 17.60,−6.11, 15.45,−76.60, 5.93,−44.05,−20.54,−92.45,−126.38,−76.60,−240.64,−117.46,−125.18,−114.98, 2.74,−29.81, 5.93,−117.46,−22.90,−47.37,−30.68,−29.94,−63.42,−44.05,−125.18,−47.37,−73.39,−73.99,−14.05,−43.97,−20.54, −114.98, −30.68, −73.99, −55.33)b20[7 × 1] = (−2.1232, −9.0403, −42.2072, 190.5292, −9.9529, 1.8162, 5.1622); d20 = 10A21[7×7] = (−0.670, 4.284, −12.837, 23.708, 1.677, −8.964, −4.859, 4.284, 21.380, −1.188,28.990, 13.216, 17.177, 16.620, −12.837, −1.189, −21.376, 9.841, −7.298, −10.043, −8.981,23.708, 28.990, 9.841, 49.385, 25.574, 15.561, 21.666, 1.677, 13.216, −7.298, 25.574, 8.419,4.149, 6.595, −8.965, 17.177, −10.043, 15.561, 4.149, 1.090, 6.292, −4.859, 16.620, −8.981,21.666, 6.594, 6.292, 5.906)b21[7 × 1] = (0.7097, −13.0982, 27.5078, −49.1608, −7.3725, 33.6731, 11.3136); d21 = −2

Page 120: High Performance Computing in Deterministic Global ...

106 Products

A22 = −A19; b22 = −b19; d22 = 45A23 = −A20; b23 = −b20; d23 = −21A24[7 × 7] = (0.0, −11.556, −1.114, 14.690, −11.411, 0.121, −0.150, −11.556, −3.316,−2.116, 7.313, −8.800, 19.897, 9.051, −1.114, −2.116, 4.728, 16.250, −4.535, 18.319, 11.537,14.690, 7.313, 16.250, 40.428, 9.766, 21.512, 15.266,−11.412,−8.800,−4.535, 9.766,−10.165,10.088, 1.889, 0.121, 19.897, 18.319, 21.511, 10.088, 28.569, 27.239, −0.150, 9.051, 11.537,15.266, 1.889, 27.239, 19.965)b24[7 × 1] = (1.7278, 23.5166, 5.6724, −32.0798, 19.0154, 16.5074, 7.31003); d24 = −5A25 = A21; b25 = b21; d25 = −1

Found solutions

Case1 & Case2

E1) Robustness ε =√

2/100. No capacity constraints.E2) Robustness ε =

√2/150. No capacity constraints.

E3) Robustness ε =√

2/150. B1 = 0.86 units available.E4) Robustness ε =

√2/150. B1 = 0.80 units available.

Environment 1

No solution was found.

Environment 2

x4 =(

RM1 RM2 RM3 RM4Case1 0.563203 0.357031 0.079766Case2 0.570312 0.282383 0.147305

)T

F (x4) = 0.625305 + 0.402004 = 1.027309.

Environment 3

x4 =(

RM1 RM2 RM3 RM4Case1 0.513438 0.378359 0.108203Case2 0.346367 0.456563 0.197070

)T

F (x4) = 0.749008 + 0.551301 = 1.300309.

Environment 4

No solution was found.

Case3 & Case4

Robustness ε =√

2/150.

x3 =(

RM1 RM2 RM3 RM4Case3 0.0 0.823125 0.176875Case4 0.524101 0.374805 0.101094

)T

Page 121: High Performance Computing in Deterministic Global ...

107

F (x3) = 1.283688 + 1.146051 = 2.429739.

x4 =(

RM1 RM2 RM3 RM4Case3 0.224609 0.775391 0.0Case4 0.573867 0.349922 0.076211

)T

F (x4) = 0.565234 + 1.056473 = 1.621707.

Uni5Spec1 & Uni5Spec5b

Robustness ε =√

2/100.

E1) No capacity constraints.E2) B1 = 0.62 units available.E3) B3 = 0.6 units available.E4) B4 = 0.21 units available.E5) B5 = 0.01 units available.E6) B1 = 0.62 and B3 = 0.6 units available.

Environments 1 and 2

x4 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.442344 0.0 0.129531Uni5Spec5b 0.14 0.0 0.174375 0.222500 0.463125

)T

F (x4) = 111.03 + 116.14 = 227.17.

Environment 3

x4 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.435234 0.0 0.136641Uni5Spec5b 0.146875 0.0 0.164063 0.232812 0.456250

)T

x5 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.442344 0.0 0.129531Uni5Spec5b 0.156172 0.03 0.152852 0.212617 0.448359

)T

F (x4) = 111.09 + 116.33 = 227.42.F (x5) = 111.03 + 116.17 = 227.20.

Environment 4

x5 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.442344 0.0 0.129531Uni5Spec5b 0.162812 0.043281 0.142891 0.209297 0.441719

)T

F (x5) = 111.03 + 116.21 = 227.24.

Page 122: High Performance Computing in Deterministic Global ...

108 Products

Environment 5

x4 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.499219 0.108203 0.392578 0.0 0.0Uni5Spec5b 0.182969 0.133125 0.253437 0.430469 0.0

)T

F (x4) = 111.36 + 117.96 = 229.32.

Environment 6

x4 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.435234 0.0 0.136641Uni5Spec5b 0.146875 0.0 0.164063 0.232812 0.456250

)T

x5 =(

RM1 RM2 RM3 RM4 RM5Uni5Spec1 0.428125 0.0 0.442344 0.0 0.129531Uni5Spec5b 0.156172 0.03 0.152852 0.212617 0.448359

)T

F (x4) = 111.09 + 116.33 = 227.42.F (x5) = 111.03 + 116.17 = 227.20.

Page 123: High Performance Computing in Deterministic Global ...

APPENDIX

BFunction definitions

Lipschitz

Levy No. 15

f(x) = sin2 3πx1 +n−1∑

i=1

(xi − 1)2(1 + sin2 3πxi+1) + (xn − 1)2 × (1 + sin2 2πxn)

Shekel 5

f(x) = −5∑

i=1

1(x− ai)(x − ai)T + ci

where

a1 = (4, 4, 4, 4), c1 = 0.1

a2 = (1, 1, 1, 1), c2 = 0.2

a3 = (8, 8, 8, 8), c3 = 0.2

a4 = (6, 6, 6, 6), c4 = 0.4

a5 = (3, 7, 3, 7), c5 = 0.4

Shekel 7

f(x) = −7∑

i=1

1(x− ai)(x − ai)T + ci

109

Page 124: High Performance Computing in Deterministic Global ...

110 Function definitions

where

a1 = (4, 4, 4, 4), c1 = 0.1

a2 = (1, 1, 1, 1), c2 = 0.2

a3 = (8, 8, 8, 8), c3 = 0.2

a4 = (6, 6, 6, 6), c4 = 0.4

a5 = (3, 7, 3, 7), c5 = 0.4

a6 = (2, 9, 2, 9), c6 = 0.6

a7 = (5, 5, 3, 3), c7 = 0.6

Shekel 10

f(x) = −10∑

i=1

1(x − ai)(x − ai)T + ci

where

a1 = (4.0, 4.0, 4.0, 4.0), c1 = 0.1

a2 = (1.0, 1.0, 1.0, 1.0), c2 = 0.2

a3 = (8.0, 8.0, 8.0, 8.0), c3 = 0.2

a4 = (6.0, 6.0, 6.0, 6.0), c4 = 0.4

a5 = (3.0, 7.0, 3.0, 7.0), c5 = 0.4

a6 = (2.0, 9.0, 2.0, 9.0), c6 = 0.6

a7 = (5.0, 5.0, 3.0, 3.0), c7 = 0.6

a8 = (8.0, 1.0, 8.0, 1.0), c8 = 0.7

a9 = (6.0, 2.0, 6.0, 2.0), c9 = 0.5

a10 = (7.0, 3.6, 7.0, 3.6), c10 = 0.5

Schwefel 1.2

f(x) =4∑

i=1

i∑

j=1

xj

2

Powell

f(x) = (x1 + 10x2)2 + 5(x3 − x4)2 + (x2 − 2x3) + 10(x1 − x4)4

Levy No. 9

f(x) = sin2 3πy1 +n−1∑

i=1

(yi − 1)2(1 + sin2 3πyi+1) + (yn − 1)2

Page 125: High Performance Computing in Deterministic Global ...

111

whereyi = 1 +

xi − 14

Levy No. 16

f(x) = sin2 3πx1 +n−1∑

i=1

(xi − 1)2(1 + sin2 3πxi+1) + (xn − 1)2(1 + sin2 2πxn)

Levy No. 10

f(x) = sin2 3πy1 +n−1∑

i=1

(yi − 1)2(1 + 10 sin2 πyi+1) + (yn − 1)2

whereyi = 1 +

xi − 14

Levy No. 17

f(x) = sin2 3πx1 +n−1∑

i=1

(xi − 1)2(1 + sin2 3πxi+1) + (xn − 1)2(1 + sin2 2πxn)

Baritompa

The test function description is given with the considered minimum point x∗, minimumvalue f∗ and maximum value f over the domain in Table 4.1 for dimensions n = 4, 5, 6.

Ackley

f(x) = −20 exp

−0.2

1n

n∑

i=1

x2i − exp

(

1n

n∑

i=1

cos(2πxi

)

+ 20 + e

x∗ = 0, f∗ = 0, f4 = 22.2, f5 = 22.2, f6 = 22.2

Dixon & Price

f(x) = (x1 − 1)2 +D∑

i=2

i(2x2i − xi−1)2

x∗i = 2

2−2i

2i , f∗ = 0, f4 = 397,021, f5 = 617,521, f6 = 88,212

Holzman

f(x) =n∑

i=1

ix4i

x∗ = 0, f∗ = 0, f4 = 100,000, f5 = 150,000, f6 = 210,000

Page 126: High Performance Computing in Deterministic Global ...

112 Function definitions

MaxMod

f(x) = max(|xi|)

x∗ = 0, f∗ = 0, f4 = 10, f5 = 10, f6 = 10

Perm

f(x) =n∑

i=1

n∑

j=1

(ji + β)

(

(

xj

j

)i

− 1

)

2

x∗i = i, f∗ = 0, f4 = 809,249, f5 = 476,712,082, f6 = 59,926,724,566

Pinter

f(x) =n∑

i=1

ix2i +

n∑

i=1

20i sin2 A +n∑

i=1

i log 10(1 + iB2)

where

A = (xi−1 sin xi + sin xi+1)B = (x2

i−1 − 2xi + 3xi+1 − cos xi + 1)

where x0 = xn and xn+1 = x0.x∗ = 0, f∗ = 0, f4 = 500, f5 = 625, f6 = 751

Quintic

f(x) =n∑

i=1

|x5i − 3x4

i + 4x3i + 2x2

i − 10xi − 4|

x∗i = −1, f∗ = 0, f4 = 532,816, f5 = 668,520, f6 = 802,224

Rastrigin

f(x) = 10n +n∑

i=1

(x2i − 10 cos(2πxi))

x∗ = 0, f∗ = 0, f4 = 153, f5 = 190, f6 = 231

Rosenbrock

f(x) =n−1∑

i=1

[100(

xi+1 − x2i

)2+ (xi − 1)2]

x∗i = 1, f∗ = 0, f4 = 2,722,743, f5 = 3,532,824, f6 = 4,342,905

Page 127: High Performance Computing in Deterministic Global ...

113

Schwefel 1.2

f(x) =n∑

i=1

i∑

j=1

xi

2

x∗ = 0, f∗ = 0, f4 = 3,000, f5 = 5,500, f6 = 9,100

Zakharov

f(x) =n∑

i=1

x2i +

(

12

n∑

i=1

ixi

)2

+

(

12

n∑

i=1

ixi

)4

x∗ = 0, f∗ = 0, f4 = 6,252,900, f5 = 31,646,750, f6 = 121,562,250

Page 128: High Performance Computing in Deterministic Global ...
Page 129: High Performance Computing in Deterministic Global ...

APPENDIX

CPublications arisen from

this thesis

The research work carried out for the present thesis resulted in a number of publications.This appendix lists them sorted by their year of publication (newest first) within eachcategory.

Publications

[A1] J.F.R. Herrera, J.M.G. Salmerón, E.M.T Hendrix, R. Asenjo, and L.G. Casado. Onload balancing strategies for shared-memory parallel branch-and-bound algorithms.The Journal of Supercomputing, Submitted.

[A2] J.F.R. Herrera, L.G. Casado, E.M.T Hendrix, and I. García. Pareto op-timality and robustness in bi-blending problems. TOP, 22(1):254–273, 2014.DOI 10.1007/s11750-012-0253-9. An erratum to this article can be found athttp://dx.doi.org/10.1007/s11750-012-0258-4.

[A3] J.F.R. Herrera, L.G. Casado, E.M.T. Hendrix, and I. García. A threaded approachof the quadratic bi-blending algorithm. The Journal of Supercomputing, 64(1):38–48,2013. DOI 10.1007/s11227-012-0783-9.

Presentations at international conferences

[B1] J.F.R. Herrera, L.G. Casado, and E.M.T Hendrix. On solving blending problems by abranch and bound algorithm using regular sub-simplices. In 27th European Conferenceon Operational Research, page 298, Glasgow, UK, July 2015.

[B2] E.M.T Hendrix, J.F.R. Herrera, L.G. Casado, and I. García. Simplicial branch andbound based on the upper fitting, longest edge bisection. In 13th EUROPT Workshopon Advances in Continuous Optimization, Edinburgh, UK, July 2015.

115

Page 130: High Performance Computing in Deterministic Global ...

116 Presentations at international conferences

[B3] J.F.R. Herrera, J.M.G. Salmerón, E.M.T. Hendrix, and L.G. Casado. Search strate-gies in libraries for parallel branch-and-bound algorithms on shared-memory systems.In J. Vigo Aguiar, editor, Proceedings of the 2015 International Conference on Math-ematical Methods in Science and Engineering (CMMSE-2015), pages 645–648, Rota,Spain, July 2015. ISBN: 978-84-617-2230-3.

[B4] R. Haijema, E.M.T Hendrix, J.F.R. Herrera, and L.G. Casado. Optimal control of traf-fic lights and the value of arrival time information. In mobil.TUM 2015 InternationalScientific Conference on Mobility and Transport, Munich, Germany, June 2015.

[B5] J.F.R. Herrera, L.G. Casado, E.M.T. Hendrix, and I. García. Heuristics for longestedge selection in simplicial branch and bound. In Computational Science and ItsApplications – ICCSA 2015, pages 445–456. Springer International Publishing, 2015.

[B6] J.F.R. Herrera, L.G. Casado, and E.M.T. Hendrix. On longest edge division in simpli-cial branch and bound. In L.G Casado, I. García, and E.M.T. Hendrix, editors, Pro-ceedings of the XII Global Optimization Workshop, Mathematical and Applied GlobalOptimization, MAGO 2014, pages 85–88, Malaga, Spain, sep 2014. ISBN: 978-84-16027-57-6.

[B7] J.F.R. Herrera, E.M.T. Hendrix, L.G. Casado, and R. Haijema. Data parallelismin traffic control tables with arrival information. In Luís Lopes, Julius Žilinskas,Alexandru Costan, RobertoG. Cascella, Gabor Kecskemeti, Emmanuel Jeannot, MarioCannataro, Laura Ricci, Siegfried Benkner, Salvador Petit, Vittorio Scarano, JoséGracia, Sascha Hunold, StephenL. Scott, Stefan Lankes, Christian Lengauer, JesusCarretero, Jens Breitbart, and Michael Alexander, editors, Euro-Par 2014: ParallelProcessing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 60–70. Springer International Publishing, 2014.

[B8] J.F.R. Herrera, L.G. Casado, E.M.T. Hendrix, and I. García. On simplicial longestedge bisection in lipschitz global optimization. In B. Murgante, S. Misra, A.M.A.C.Rocha, C. Torre, J.G. Rocha, M.I. Falcão, D. Taniar, B.O. Apduhan, and O. Gervasi,editors, Computational Science and Its Applications – ICCSA 2014, volume 8580 ofLecture Notes in Computer Science, pages 104–114. Springer International Publishing,2014. ISBN: 978-3-319-09128-0. DOI 10.1007/978-3-319-09129-7_8.

[B9] J.F.R. Herrera, L.G. Casado, E.M.T. Hendrix, R. Paulavičius, and J. Žilinskas. Dy-namic and hierarchical Load-Balancing techniques applied to parallel Branch-and-Bound methods. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC),Eighth International Conference on, Compiègne, France, Oct 2013. ISBN: 978-0-7695-5094-7. DOI 10.1109/3PGCIC.2013.85.

[B10] J.F.R. Herrera, L.G. Casado, E.M.T Hendrix, and I. García. Improvements of sequen-tial and parallel bi-blending algorithms. In Book of Abstracts of the 4th InternationalConference on Continuous Optimization, page 75, Caparica, Portugal, Lisbon, July2013.

[B11] J.F.R. Herrera, L.G. Casado, R. Paulavičius, J. Žilinskas, and E.M.T. Hendrix. Ona hybrid MPI-Pthread approach for simplicial branch-and-bound. In Parallel andDistributed Processing Symposium Workshops & PhD Forum, 2013 IEEE 27th Inter-national, pages 1764–1770, Boston, Massachusetts USA, May 2013. ISBN: 978-0-7695-4979-8. DOI 10.1109/IPDPSW.2013.178.

Page 131: High Performance Computing in Deterministic Global ...

Presentations at national conferences 117

[B12] J.F.R. Herrera, L.G. Casado, E.M.T. Hendrix, and I. García. On generating robustsolutions of quadratically constrained bi-blending recipe design. In Proceedings of ORPeripatetic Post-Graduate Programme (ORP3-2011), pages 111–115, Cádiz, Spain,September 2011. ISBN: 978-84-9828-348-8.

[B13] J.F.R. Herrera, L.G. Casado, E.M.T Hendrix, and I. García. On designing robustmixture recipes for multiple products simultaneously. In Book of Abstracts of theOptimization 2011 conference, page 119, Caparica, Portugal, July 2011.

[B14] J.F.R. Herrera, L.G. Casado, I. García, and E.M.T Hendrix. On parallelizing a bi-blend optimization algorithm. In J. Vigo Aguiar, editor, Proceedings of the 2011 Inter-national Conference on Mathematical Methods in Science and Engineering (CMMSE-2011), volume II, pages 642–653, Benidorm, Spain, June 2011. ISBN: 978-84-614-6167-7.

Presentations at national conferences

[C1] J.F.R. Herrera, T. Menouer, B. Le Cun, E.M.T. Hendrix, and L.G. Casado. Resolu-ción de problemas de optimización global mediante el framework Bobpp. In Actasde las XXVI Jornadas de Paralelismo, pages 494–499, Cordoba, Spain, September2015. ISBN: 978-84-16017-52-2.

[C2] J.F.R. Herrera, E.M.T. Hendrix, L.G. Casado, and R. Haijema. Paralelismo dedatos en la obtención de tablas de control de tráfico con información de llegada. InA. Gonzalez-Escribano, D.R. Llanos, and B. Sahelices, editors, Actas de las XXVJornadas de Paralelismo, pages 219–224, Valladolid, Spain, September 2014. ISBN:978-84-697-0329-3.

[C3] J.F.R. Herrera, L.G. Casado, I. García, and E.M.T. Hendrix. Paralelización delalgoritmo de bi-mezcla. In F. Almeida, V. Blanco, C. León, C. Rodríguez, andF. de Sande, editors, Actas de las XXII Jornadas de Paralelismo, pages 21–26, LaLaguna, Spain, September 2011. ISBN: 978-84-694-1791-1.

Page 132: High Performance Computing in Deterministic Global ...
Page 133: High Performance Computing in Deterministic Global ...

APPENDIX

DOther publications

produced during theelaboration of this thesis

The research effort invested during the time span in which this thesis was elaboratedproduced some additional publications as the result of other research lines not included inthe present dissertation. Those lines were portfolio optimization [D1], simplicial DIRECT[D2] and minimum search tree in branch-and-bound [D3]. This appendix lists them sortedby their year of publication (newest first).

Other publications

[D1] E.M.T. Hendrix, J.F.R. Herrera, , M. Janssen, and L.G. Casado. On finding optimalportfolios with risky assets. In Book of Abstracts of the 21st International Sympo-sium on Mathematical Programming (ISMP-2012), page 228, Berlin, Germany, August2012.

[D2] R. Paulavičius, J. Žilinskas, J.F.R. Herrera, and L.G. Casado. A parallel DISIMPL forpile placement optimization in grillage-type foundations. In P2P, Parallel, Grid, Cloudand Internet Computing (3PGCIC), Eighth International Conference on, Compiègne,France, Oct 2013. ISBN: 978-0-7695-5094-7. DOI 10.1109/3PGCIC.2013.90.

[D3] J.M.G. Salmerón, J.F.R. Herrera, G. Aparicio, L.G. Casado, I. García, and E.M.T.Hendrix. Estrategias paralelas para obtener el tamaño del árbol mínimo en la divisiónpor el lado mayor de un símplice regular. In A. Gonzalez-Escribano, D.R. Llanos,and B. Sahelices, editors, Actas de las XXV Jornadas de Paralelismo, pages 29–35,Valladolid, Spain, September 2014. ISBN: 978-84-697-0329-3.

119

Page 134: High Performance Computing in Deterministic Global ...
Page 135: High Performance Computing in Deterministic Global ...

Bibliography

[1] F. Almeida, D. Gonzalez, and I. Pelaez. Parallel Combinatorial Optimization, chapterParallel Dynamic Programming, pages 29–52. Wiley, 2006.

[2] P. Amar, M. Baillieul, D. Barth, B. Le Cun, F. Quessette, and S. Vial. Parallel biologicalin silico simulation. In T. Czachórski, E. Gelenbe, and R. Lent, editors, InformationSciences and Systems 2014, pages 387–394. Springer International Publishing, 2014.

[3] G. Aparicio, L. Casado, B. G-Tóth, E. Hendrix, and I. García. Heuristics to reducethe number of simplices in longest edge bisection refinement of a regular n-simplex. InB. Murgante, S. Misra, A. Rocha, C. Torre, J. Rocha, M. Falcão, D. Taniar, B. Ap-duhan, and O. Gervasi, editors, Computational Science and Its Applications – ICCSA2014, volume 8580 of Lecture Notes in Computer Science, pages 115–125. Springer In-ternational Publishing, 2014.

[4] G. Aparicio, L. G. Casado, E. M. T. Hendrix, B. G.-Tóth, and I. Garcia. On theminimum number of simplex shapes in longest edge bisection refinement of a regularn-simplex. Informatica, 26(1):17–32, 2015.

[5] G. Aparicio, L. G. Casado, E. M. T. Hendrix, I. Garcia, and B. G. Toth. On compu-tational aspects of a regular n-simplex bisection. In P2P, Parallel, Grid, Cloud andInternet Computing (3PGCIC), 2013 Eighth International Conference on, pages 513–518, Oct 2013.

[6] J. Ashayeri, A. G. M. van Eijs, and P. Nederstigt. Blending modelling in a processmanufacturing: A case study. European Journal of Operational Research, 72(3):460–468, 1994.

[7] W. Baritompa. Customizing methods for global optimization, a geometric viewpoint.Journal of Global Optimization, 3(2):193–212, 1993.

[8] R. Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics,6(5), 1957.

[9] J. L. Berenguel, L. G. Casado, I. García, and E. M. T. Hendrix. On estimating work-load in interval branch-and-bound global optimization algorithms. Journal of GlobalOptimization, 2011.

[10] J. L. Berenguel, L. G. Casado, I. García, and E. M. T. Hendrix. On estimating work-load in interval branch-and-bound global optimization algorithms. Journal of GlobalOptimization, 56(3):821–844, 2013.

121

Page 136: High Performance Computing in Deterministic Global ...

122 Bibliography

[11] J. W. M. Bertrand and W. G. M. M. Rutten. Evaluation of three production planningprocedures for the use of recipe flexibility. European Journal of Operational Research,115(1):179–194, 1999.

[12] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier,S. Thibault, and R. Namyst. hwloc: A generic framework for managing hardwareaffinities in HPC applications. In Parallel, Distributed and Network-Based Processing(PDP), 2010 18th Euromicro International Conference on, pages 180–186, Feb 2010.

[13] L. G. Casado, E. M. T. Hendrix, and I. García. Infeasibility spheres for finding robustsolutions of blending problems with quadratic constraints. Journal of Global Optimiza-tion, 39(4):577–593, 2007.

[14] L. G. Casado, J. A. Martínez, I. García, and E. M. T. Hendrix. Branch-and-boundinterval global optimization on shared memory multiprocessors. Optimization Methodsand Software, 23(5):689–701, 2008.

[15] F. M. G. Casas. Curso de optimización, programación matemática. Ariel, Spain, 1994.

[16] B. Chapman, G. Jost, and R. v. d. Pas. Using OpenMP: Portable Shared MemoryParallel Programming (Scientific and Engineering Computation). The MIT Press, 2007.

[17] J. Clausen. Parallel branch and bound - principles and personal experiences. InA. Migdalas, P. Pardalos, and S. Storøy, editors, Parallel Computing in Optimization,volume 7 of Applied Optimization, pages 239–267. Springer US, 1997.

[18] J. Claussen and A. Zilinskas. Subdivision, sampling and initialization strategies forsimplicial branch and bound in global optimization. Computers & Mathematics withApplications, 44:943–955, 2002.

[19] T. G. Crainic, B. Le Cun, and C. Roucairol. Parallel Branch-and-Bound Algorithms,pages 1–28. John Wiley & Sons, Inc., 2006.

[20] Z. Cui, Y. Liang, K. Rupnow, and D. Chen. An accurate GPU performance modelfor effective control flow divergence optimization. In Parallel Distributed ProcessingSymposium (IPDPS), 2012 IEEE 26th International, pages 83–94, May 2012.

[21] J. Diaz, C. Munoz-Caro, and A. Nino. A survey of parallel programming modelsand tools in the multi and many-core era. Parallel and Distributed Systems, IEEETransactions on, 23(8):1369–1386, Aug 2012.

[22] A. Djerrah, B. Le Cun, V.-D. Cung, and C. Roucairol. Bob++: Framework for solv-ing optimization problems with branch-and-bound methods. In High Performance Dis-tributed Computing, 2006 15th IEEE International Symposium on, pages 369–370, 2006.

[23] J. Dongarra. Visit to the National University for Defense Technology Changsha, China,2013.

[24] Z. Drezner and A. Suzuki. The big triangle small triangle method for the solution ofnonconvex facility location problems. Operations Research, 52(1):128–135, 2004.

[25] F. A. Escobar, X. Chang, and C. Valderrama. Suitability analysis of fpgas for hetero-geneous platforms in HPC. Parallel and Distributed Systems, IEEE Transactions on,2015.

Page 137: High Performance Computing in Deterministic Global ...

Bibliography 123

[26] J. F. S. Estrada, L. G. Casado, and I. García. Adaptive parallel interval global optimiza-tion algorithms based on their performance for non-dedicated multicore architectures.In Parallel, Distributed and Network-Based Processing (PDP), 2011 19th EuromicroInternational Conference on, pages 252–256, February 2011.

[27] M. Flynn. Some computer organizations and their effectiveness. Computers, IEEETransactions on, C-21(9):948–960, Sept 1972.

[28] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sa-hay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham,and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPIimplementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, pages97–104, Budapest, Hungary, September 2004.

[29] F. Galea and B. Le Cun. Bob++ : a framework for exact combinatorial optimizationmethods on parallel machines. In PGCO’2007 as part of the 2007 High PerformanceComputing & Simulation (HPCS’07) Conference, pages 779–785, Prague, Czech Repub-lic, June 2007.

[30] B. Gendron and T. G. Crainic. Parallel branch-and-bound algorithms: Survey andsynthesis. Operations Research, 42(6):1042–1066, 1994.

[31] A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing.Addison-Wesley, 2003.

[32] R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of proces-sors. In Proceedings of the 3rd International Conference on Architectural Support forProgramming Languages and Operating Systems, pages 54–63, Boston, MA, 1989.

[33] R. Gupta and C. Hill. A scalable implementation of barrier synchronization using anadaptive combining tree. International Journal of Parallel Programming, 18(3):161–180,1989.

[34] R. Haijema. Solving large structured Markov Decision Problems for perishable inventorymanagement and traffic control. PhD thesis, Universiteit van Amsterdam, 2008.

[35] R. Haijema and E. M. T. Hendrix. Traffic responsive control of intersections with pre-dicted arrival times: A Markovian approach. Computer-Aided Civil and InfrastructureEngineering, 29(2):123–139, 2014.

[36] R. Haijema and J. van der Wal. An MDP Decomposition Approach for Traffic Controlat Isolated Signalized Intersections. Probability in the Engineering and InformationalSciences, 22:587–602, 2008.

[37] A. Hannukainen, S. Korotov, and M. Křížek. On numerical regularity of the face-to-face longest-edge bisection algorithm for tetrahedral partitions. Science of ComputerProgramming, 90:34 – 41, 2014.

[38] P. Hansen, D. Peeters, and J. Thisse. On the location of an obnoxious facility. SistemiUrbani, 3:299–317, 1981.

[39] E. Hendrix and J. Pintér. An application of Lipschitzian global optimization to productdesign. Journal of Global Optimization, 1(4):389–401, 1991.

Page 138: High Performance Computing in Deterministic Global ...

124 Bibliography

[40] E. M. T. Hendrix, L. G. Casado, and P. Amaral. Global optimization simplex bisec-tion revisited based on considerations by Reiner Horst. In B. e. a. Murgante, editor,Computational Science and Its Applications ICCSA 2012, volume 7335 of Lecture Notesin Computer Science, pages 159–173. Springer, 2012.

[41] E. M. T. Hendrix, L. G. Casado, and I. García. The semi-continuous quadratic mixturedesign problem: Description and branch-and-bound approach. European Journal ofOperational Research, 191(3):803–815, 2008.

[42] E. M. T. Hendrix, C. J. Mecking, and T. H. B. Hendriks. Finding robust solutionsfor product design problems. European Journal of Operational Research, 92(1):28–36,1996.

[43] J. F. R. Herrera, L. G. Casado, E. M. T. Hendrix, and I. García. A threaded approachof the quadratic bi-blending algorithm. The Journal of Supercomputing, 64(1):38–48,2013. DOI 10.1007/s11227-012-0783-9.

[44] J. F. R. Herrera, L. G. Casado, E. M. T. Hendrix, R. Paulavicius, and J. Zilinskas. Dy-namic and hierarchical load-balancing techniques applied to parallel branch-and-boundmethods. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013Eighth International Conference on, pages 497–502, Oct 2013.

[45] J. F. R. Herrera, L. G. Casado, R. Paulavicius, J. Zilinskas, and E. M. T. Hendrix.On a hybrid mpi-pthread approach for simplicial branch-and-bound. In Parallel andDistributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27thInternational, pages 1764–1770, May 2013.

[46] J. F. R. Herrera, E. M. T. Hendrix, L. G. Casado, and R. Haijema. Data parallelismin traffic control tables with arrival information. In L. Lopes, J. Žilinskas, A. Costan,R. Cascella, G. Kecskemeti, E. Jeannot, M. Cannataro, L. Ricci, S. Benkner, S. Pe-tit, V. Scarano, J. Gracia, S. Hunold, S. Scott, S. Lankes, C. Lengauer, J. Carretero,J. Breitbart, and M. Alexander, editors, Euro-Par 2014: Parallel Processing Workshops,volume 8805 of Lecture Notes in Computer Science, pages 60–70. Springer InternationalPublishing, 2014.

[47] F. S. Hillier and G. J. Lieberman. Introduction to Operations Research, 9th Ed.McGraw-Hill Higher Education, 2010.

[48] R. Horst. On generalized bisection of n-simplices. Mathematics of Computation,66(218):691–698, 1997.

[49] R. Horst and H. Tuy. Global Optimization (Deterministic Approaches). Springer,Berlin, 1990.

[50] T. Ibaraki. Theoretical comparisons of search strategies in branch and bound algo-rithms. International Journal of Computing and Information Sciences, 5(4):315–344,1976.

[51] T. Ibaraki. Enumerative approaches to combinatorial optimization - part I. Annals ofOperations Research, 10:3–342, January 1988.

[52] M. Jamil and X. Yang. A literature survey of benchmark functions for global opti-mization problems. International Journal of Mathematical Modelling and NumericalOptimisation, 4(2):150–194, 2013.

Page 139: High Performance Computing in Deterministic Global ...

Bibliography 125

[53] B. J. Kubica. A class of problems that can be solved using interval algorithms. Com-puting, 94(2):271–280, 2012.

[54] T.-H. Lai and S. Sahni. Anomalies in parallel branch-and-bound algorithms. Commu-nications of the ACM, 27(6):594–602, June 1984.

[55] E. L. Lawler and D. E. Wood. Branch-and-bound methods: a survey. OperationsResearch, 14(4):699–719, 1966.

[56] G.-J. Li and B. W. Wah. Coping with anomalies in parallel branch-and-bound algo-rithms. IEEE Transactions on Computers, 35(6):568–573, June 1986.

[57] J. Li, G. Tan, and M. Chen. Automatically tuned dynamic programming with analgorithm-by-blocks. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16thInternational Conference on, pages 452–459, Dec 2010.

[58] J. D. C. Little, K. G. Murty, D. W. Sweeney, and C. Karel. An algorithm for thetraveling salesman problem. Operations Research, 11(6):972–989, 1963.

[59] I. J. B. F. A. M. S. van den Broek, J. S. H. van Leeuwaarden and O. J. Boxma.Bounds and approximations for the fixed-cycle traffic-light queue. Transportation Sci-ence, 40(4):484–496, 2006.

[60] J. A. Martínez, L. G. Casado, J. A. Alvarez, and I. García. Interval parallel globaloptimization with Charm++. In J. Dongarra, K. Madsen, and J. Wasniewski, editors,Applied Parallel Computing. State of the Art in Scientific Computing, volume 3732 ofLecture Notes in Computer Science, pages 161–168. Springer Berlin / Heidelberg, 2006.

[61] C. C. Meewella and D. Q. Mayne. An algorithm for global optimization of Lipschitzcontinuous functions. Journal of Optimization Theory and Applications, 57:307–322,1988.

[62] J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branchand memory divergence tolerance. SIGARCH Computer Architecture News, 38(3):235–246, June 2010.

[63] T. Menouer and B. Le Cun. Anticipated dynamic load balancing strategy to paral-lelize constraint programming search. In Parallel and Distributed Processing SymposiumWorkshops PhD Forum (IPDPSW), 2013 IEEE 27th International, pages 1771–1777,May 2013.

[64] T. Menouer and B. Le Cun. A parallelization mixing or-tools/gecode solvers on topof the bobpp framework. In P2P, Parallel, Grid, Cloud and Internet Computing (3PG-CIC), 2013 Eighth International Conference on, pages 242–246, Oct 2013.

[65] T. Menouer and B. Le Cun. Adaptive N to P portfolio for solving constraint pro-gramming problems on top of the parallel bobpp framework. In Parallel DistributedProcessing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 1531–1540, May 2014.

[66] L. G. Mitten. Branch and bound methods: general formulation and properties. Oper-ations Research, 18(1):24–34, 1970.

[67] R. H. Mladineo. An algorithm for finding the global maximum of a multimodal mul-tivariate function. Mathematical Programming, 34:188–200, 1986.

Page 140: High Performance Computing in Deterministic Global ...

126 Bibliography

[68] R. Nanjegowda, O. Hernandez, B. Chapman, and H. Jin. Scalability evaluation ofbarrier algorithms for openmp. In M. Müller, B. de Supinski, and B. Chapman, editors,Evolving OpenMP in an Age of Extreme Parallelism, volume 5568 of Lecture Notes inComputer Science, pages 42–52. Springer Berlin Heidelberg, 2009.

[69] A. Neumaier. Complete search in continuous global optimization and constraint satis-faction. Acta Numerica, 13:271–369, 2004.

[70] G. F. Newell. Approximation methods for queues with application to the fixed-cycletraffic light. SIAM Review, 7(2):223–240, 1965.

[71] D. Padua, editor. Encyclopedia of Parallel Computing. Springer US, 2011.

[72] M. Papageorgiou, C. Diakaki, V. Dinopoulou, A. Kotsialos, and Y. Wang. Review ofroad traffic control strategies. Proceedings of the IEEE, 91(12):2043–2067, 2003.

[73] R. Paulavičius and J. Žilinskas. Improved Lipschitz bounds with the first norm forfunction values over multidimensional simplex. Mathematical Modelling and Analysis,13:553–563, 2008.

[74] R. Paulavičius and J. Žilinskas. Global optimization using the branch-and-bound al-gorithm with a combination of Lipschitz bounds over simplices. Technological and Eco-nomic Development of Economy, 15(2):310–325, 2009.

[75] R. Paulavičius and J. Žilinskas. Simplicial Global Optimization. SpringerBriefs inOptimization. Springer New York, 2014.

[76] R. Paulavičius, J. Žilinskas, and A. Grothey. Investigation of selection strategies inbranch and bound algorithm with simplicial partitions and combination of Lipschitzbounds. Optimization Letters, 4(2):173–183, 2010.

[77] R. Paulavičius, J. Žilinskas, and A. Grothey. Parallel branch and bound for globaloptimization with combination of Lipschitz bounds. Optimization Methods and Software,26(3):487–498, 2011.

[78] J. Pintér. Lipschitzian global optimization: Some prospective applications. In C. A.Floudas and P. M. Pardalos, editors, Recent Advances in Global Optimization, pages399–432. Princeton University Press, Princeton, NJ, USA, 1992.

[79] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Program-ming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.

[80] J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core ProcessorParallelism. O’Reilly, 2007.

[81] R. Reyes, I. López-Rodríguez, J. J. Fumero, and F. de Sande. accULL: an Ope-nACC implementation with CUDA and OpenCL support. In C. Kaklamanis, T. Pap-atheodorou, and P. Spirakis, editors, Euro-Par 2012 Parallel Processing, volume 7484of Lecture Notes in Computer Science, pages 871–882. Springer Berlin Heidelberg, 2012.

[82] R. Sakellariou and J. R. Gurd. Compile-time minimisation of load imbalance in loopnests. In Proceedings of the 11th international conference on Supercomputing, ICS ’97,pages 277–284, New York, NY, USA, 1997. ACM.

Page 141: High Performance Computing in Deterministic Global ...

Bibliography 127

[83] J. F. Sanjuan-Estrada, L. G. Casado, and I. García. Adaptive parallel interval branchand bound algorithms based on their performance for multicore architectures. TheJournal of Supercomputing, 58(3):376–384, 2011.

[84] M. Scott and J. Mellor-Crummey. Fast, contention-free combining tree barriersfor shared-memory multiprocessors. International Journal of Parallel Programming,22(4):449–481, 1994.

[85] L. Smith and M. Bull. Development of mixed mode MPI/OpenMP applications. Sci-entific Programming, 9(2,3):83–98, 2001.

[86] P. Stenstrom. Reducing contention in shared-memory multiprocessors. Computer,21(11):26–37, 1988.

[87] M. J. Todd. The computation of fixed points and applications, volume 24 of LectureNotes in Economics and Mathematical Systems. Springer-Verlag, 1976.

[88] J. S. H. van Leeuwaarden. Delay analysis for the fixed-cycle traffic-light queue. Trans-portation Science, 40(2):189–199, 2006.

[89] H. P. Williams. Model Building in Mathematical Programming. Wiley & Sons, Chich-ester, 1993.

[90] D. Wingate and K. D. Seppi. P3VI: A partitioned, prioritized, parallel value iterator.In Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04, pages 109–, New York, NY, USA, 2004. ACM.

Page 142: High Performance Computing in Deterministic Global ...
Page 143: High Performance Computing in Deterministic Global ...

List of Figures

1.1 Global and local optima in a two-dimensional function . . . . . . . . . . . . 21.2 Three-dimensional representation of a function with multiple local minima . 41.3 Pareto front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Stochastic decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Memory map for a BullX-UAL node . . . . . . . . . . . . . . . . . . . . . . 19

2.1 2D and 3D simplices removing the minimum dose region . . . . . . . . . . . 292.2 Rejection by domination, Pareto optimality . . . . . . . . . . . . . . . . . . 312.3 Example of division according to the raw material costs . . . . . . . . . . . 332.4 Case1 & Case2 with a capacity restriction in RM1 (B1 = 0.86) . . . . . . . 402.5 Case3 & Case4 with a capacity restriction in RM1 (B1 = 0.86) . . . . . . . 42

4.1 Division of a hypercube into six irregular simplices . . . . . . . . . . . . . . 534.2 Longest Edge Bisection, denoting a sub-simplex with three longest edges . . 54

5.1 Speedup and CPU usage histogram for Bobpp, TBB and Pthread model. . 71

6.1 Traffic infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.1 State ordering for instance n. 17 . . . . . . . . . . . . . . . . . . . . . . . . 937.2 Message passing in lights (column-wise) distribution for F4C2 . . . . . . . . 94

129

Page 144: High Performance Computing in Deterministic Global ...
Page 145: High Performance Computing in Deterministic Global ...

List of Tables

2.1 Numerical results of three-dimensional cases . . . . . . . . . . . . . . . . . . 412.2 Numerical results of five-dimensional cases . . . . . . . . . . . . . . . . . . . 442.3 Numerical results of of seven-dimensional cases . . . . . . . . . . . . . . . . 452.4 Numerical results for the iterative-descending method . . . . . . . . . . . . 45

3.1 Computational effort of the B&B phase using two working threads . . . . . 493.2 Speedup obtained in the Combination phase . . . . . . . . . . . . . . . . . . 50

4.1 Test instances for dimension n = 4, 5, 6, and the corresponding Kn values . 584.2 Test functions using Lipschitz bounds . . . . . . . . . . . . . . . . . . . . . 584.3 Comparison between hybrid and depth-first search . . . . . . . . . . . . . . 594.4 Experimental results using LEB1 . . . . . . . . . . . . . . . . . . . . . . . . 604.5 Experimental results for n = 4 using K-bounds . . . . . . . . . . . . . . . . 614.6 Experimental results for n = 5 using K-bounds . . . . . . . . . . . . . . . . 624.7 Experimental results for n = 6 using K-bounds . . . . . . . . . . . . . . . . 624.8 Experimental results using Lipschitz bounds . . . . . . . . . . . . . . . . . . 63

5.1 Test-bed sequentially solved using Bobpp and custom-made version . . . . . 725.2 Elapsed time of the Bobpp version varying the number of priority queues . 735.3 Speedup of Pthread and TBB versions varying the number of threads . . . 745.4 MPI-Pthread approach using 16 threads per MPI process . . . . . . . . . . 755.5 Fully-dynamic approach sharing upper bound and work . . . . . . . . . . . 76

6.1 Numerical results for the F4C2 infrastructure . . . . . . . . . . . . . . . . . 886.2 Numerical results for the I2F2C2 infrastructure . . . . . . . . . . . . . . . . 89

7.1 Speedup of Pthread and MPI-Pthread versions of TCT generation . . . . . 94

131

Page 146: High Performance Computing in Deterministic Global ...
Page 147: High Performance Computing in Deterministic Global ...

List of Algorithms

2.1 Branch and Bound algorithm for the QBB problem . . . . . . . . . . . . . . 322.2 Combination algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3 Iterative α-descending B&B algorithm for the QBB problem . . . . . . . . . 38

4.1 Simplicial B&B algorithm, bisection . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Value Iteration by Backward Induction . . . . . . . . . . . . . . . . . . . . . 86

133