Indice - SEIO · 2018. 8. 8. · Bolet n de Estad stica e Investigacion Operativa Vol. 34, No. 2,...

Boletın de Estadıstica e Investigacion OperativaVol. 34, No. 2, Julio 2018

Indice

Editorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Antonio Jimenez Martın

Estadıstica 97

Quantile regression: estimation and lack-of-fit tests . . . . . . . . . . . . . .Mercedes Conde-Amboage, Wenceslao Gonzalez-Manteiga and Cesar Sanchez-

Sellero

Investigacion Operativa 117

An ABC algorithm for solving the post-disaster resources distri-

bution problem, a case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Henry Lamos, Karin Aguilar, Daniel Martınez, Andres Barrera and Angie

Hernandez

Estadıstica Oficial 138

Quality implications of the use of big data in tourism statistics:

three exploratory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Fernando Cortina Garcıa, Jesus Prado Mascunano, Marıa Izquierdo Valverde

and Marıa Velasco Gimeno

Historia y Ensenanza 149

Multivariate continuous probability distributions and partial dif-

ferential equations: A simple and nice connection . . . . . . . . . . . . .Julia Calatayud Gregori, Juan Carlos Cortes Lopez and Marc Jornet Sanz

Opiniones sobre la profesion 159

Natural Language Parsing: Progress and Challenges . . . . . . . . . . . . .Carlos Gomez-Rodrıguez

c© 2018 SEIO

Boletın de Estadıstica e Investigacion OperativaVol. 34, No. 2, Julio 2018, pp. 92-96

Editorial

Antonio Jimenez Martın

Departamento de Inteligencia Artificial

Escuela Tecnica Superior de Ingenieros Informaticos Universidad Politecnica

de Madrid

B [email protected]

La toma de decisiones es el estudio para identificar y elegir alternativas

basadas en valores y preferencias de un decisor o, equivalentemente, el proceso

de reduccion de la incertidumbre sobre ciertas alternativas para permitir una

eleccion razonable entre ellas (vease [9]).

El analisis multicriterio es una herramienta de apoyo en la toma de decisiones

que permite integrar diferentes criterios de acuerdo a la opinion de actores en

un solo marco de analisis para dar una vision integral. Sus principios se derivan

de las Teorıas de matrices, de grafos, de las organizaciones, de la medida, de la

Investigacion Operativa y de la Economıa. Es una actividad que ayuda a tomar

decisiones principalmente en terminos de eleccion, ordenacion y clasificacion de

alternativas.

Los orıgenes de las tecnicas de decision multicriterio se encuentran en el

siglo XX, en el que a partir de los anos 50 comenzaron a desarrollarse estudios

que trataban el enfoque multicriterio en los procesos de decision, como los de

Koopmans [14], en los cuales se desarrolla el concepto de vector eficiente; o los

de Kuhn y Tucker [15], que ademas de formular las condiciones de optimalidad

en programacion no lineal, consideraron problemas con multiples objetivos.

En 1955 Charnes, Cooper y Ferguson publicaron [4], que contiene la esencia

de la programacion por metas, el cual posteriormente dio lugar la publicacion

del libro de Charnes y Cooper [3]. Tambien se puede destacar las contribuciones

de Howard y Kimball en 1959 en su libro sobre procesos de decision secuenciales

y Contini y Zionts, que desarrollaron un modelo de negociacion multicriterio

en 1968. Este ultimo, en colaboracion con Wallenius y Korhonen desarrollaron

posteriormente, a finales de la decada de los 70, distintos metodos interactivos

para resolver problemas de programacion lineal multiobjetivo.

Por otro lado, a mediados de los 60, Roy y sus colegas desarrollaron ELEC-

TRE, una familia de analisis de decisiones multicriterio basada en el axioma de

comparabilidad parcial fundamental; mientras que Keeney y Raiffa publicaron

en 1976 un libro que fue fundamental para el establecimiento de la Teorıa de

c© 2018 SEIO

Editorial 93

valor/utilidad multiatributo; y Saaty propuso en los 70 el proceso analıtico je-

rarquico (AHP), que se caracteriza por la modelizacion del problema mediante

una estructura jerarquica, la utilizacion de comparaciones por pares para incor-

porar las preferencias del decisor y el uso de una escala de razon valida para la

toma de decisiones compleja.

Las tecnicas multicriterio (vease [6]) se pueden clasificar en funcion del con-

junto de alternativas que considera el decisor a la hora de buscar una solucion

optima. Si se acepta un conjunto de alternativas finito, el metodo de decision

tendra un caracter discreto. Por otro lado, cuando el problema toma un numero

infinito de valores y conduce a un numero infinito de alternativas posibles, se

denomina decision multiobjetivo.

Dentro de la decision multicriterio discreta se pueden distinguir tres gru-

pos o familias principales. Por una parte, los metodos basados en la Teorıa de

la Utilidad Multi-Atributo (MAUT, Multi-Attribute Utility Theory) (veanse

[7, 12, 21]), propios de la Escuela Americana; por otro, los metodos llamados

de Superacion, Sobreclasificacion u Outranking (veanse [1, 2, 19]), propios de

la Escuela Franco-Belga; y el Proceso Analıtico Jerarquico (AHP, Analytical

Hierarchy Process) (vease [20]), desarrollado en la decada de los 70 del pasado

siglo por Thomas Saaty.

En la decision multiobjetivo, podemos distinguir entre los metodos de opti-

mizacion multiobjetivo y los metodos satisfacientes (programacion por metas,

veanse [4, 18]). A su vez, dentro de los metodos de optimizacion multiobjeti-

vo existen metodos para generar el conjunto eficiente en su totalidad (veanse

[23, 24]) y metodos para dar una solucion compromiso (veanse [22, 24, 25]).

Cabe finalmente destacar el uso mas reciente de metaheurısticas (veanse

[8, 17]) y la borrosificacion (vease [11]) en la resolucion de problemas de opti-

mizacion multiobjetivo.

Por otro lado, cuando la toma de decisiones no se lleva a cabo a nivel indivi-

dual, sino que participan multiples decisores, cada uno de ellos con sus creencias,

preferencias y percepciones sobre la cuestion abordada, todas ellas deben con-

siderarse conjuntamente en la decision final. En general, el caracter colectivo

de este proceso enriquece el analisis y mejora la calidad de la decision, ya que

cada miembro implicado posee informacion relevante sobre el problema y, en

consecuencia, la posibilidad de pasar por alto ciertos sucesos o alternativas es

menor que en el caso individual. Esta situacion de toma decisiones en grupo

(vease [10]) ha llevado al desarrollo de procesos de agregacion (agregacion de

juicios sobre preferencias y probabilısticos, agregacion de comportamientos es-

tructurados (metodo Delphi, vease [5]) y no estructurados), y de procesos de

negociacion (vease [16]).

El Grupo de Trabajo en Decision Multicriterio (GTDM-SEIO) fue crea-

do en 1999 por los miembros del Grupo Espanol de Decision Multicriterio

94 A. Jimenez Martın

(http://multicriterio.es) que pertenecıan a la SEIO. Actualmente, esta formado

por 60 investigadores de 17 universidades espanolas.

El GTDM-SEIO pretende aunar a todos los investigadores de la SEIO que

trabajen en decision multicriterio, con el objetivo promover la comunicacion y la

investigacion entre ellos, reforzar la proyeccion de la decision multicriterio den-

tro de la Estadıstica e Investigacion Operativa, establecer nuevas relaciones con

otras sociedades cientıficas, como los distintos EURO Working Groups (MCDA,

ECCO, DSS, ME...), la International Society on Multiple Criteria Decision Ma-

king (MCDM) o el Grupo Espanol de Decision Multicriterio, organizar sesiones

especializadas en los congresos de la SEIO, y hacer que el grupo se convierta en

un referente nacional.

Cabe destacar la participacion en el GTDM-SEIO de equipos de investi-

gacion de prestigio en distintas lıneas de investigacion dentro de la decision

multicriterio, tanto en el ambito continuo (metaheurısticas, optimizacion multi-

objetivo, programacion por metas...) como discreto (AHP, MAUT, outranking

methods...), tanto teoricas como aplicadas (evaluacion y seleccion ambiental,

valoracion, finanzas, logıstica, energıa, ingenierıa...). Tales grupos se enriquecen

y actualizan bajo una accion conjunta que favorezca la puesta en contacto entre

los mismos, de manera que cada uno pueda aportar su vision sobre los temas

abordados por los otros.

El GTDM-SEIO tiene una relacion estrecha con el Grupo Espanol de Deci-

sion Multicriterio, compuesto por alrededor de 247 investigadores de casi una

treintena de Universidades espanolas, pertenecientes a muy diversas areas de

conocimiento (Estadıstica e Investigacion Operativa, Economıa Aplicada, Me-

todos Cuantitativos para la Economıa y Empresa, Ciencias de la Computacion e

Inteligencia Artificial, Ingenierıa de Proyectos, Organizacion de Empresas, Eco-

nomıa Agraria, etc.).

Segun datos de la International Society on MCDM y del European Working

Group on MultiCriteria Decision Analysis, la actividad investigadora en Espa-

na en decision multicriterio presenta una posicion de liderazgo a nivel mundial,

tanto en productividad cientıfica como en recursos humanos, con un reconoci-

miento internacional fruto del cual es que algunos de sus integrantes han ocupado

cargos de responsabilidad en grupos de trabajo internacionales y han recibido

reconocimientos internacionales a su labor investigadora.

Referencias

[1] Brans, J.P. (1982), L’ingenierie de la decision: elaboration d’instruments

d’aide a la decision. La methode PROMETHEE, En: L’aide a la decision:

Nature, Instruments et Perspectives d’Avenir (Nadeau y Landry, eds.), 183-

213. Presses de l’Universite Laval, Quebec.

Editorial 95

[2] Brans., J.P, Vincke, P. (1985), A preference ranking organisation method:

The PROMETHEE method for MCDM, Management Science 31, 647-656.

[3] Charnes, A., Cooper, W.W. (1961), Management models and industrial ap-

plications of linear programming. John Wiley & Sons, Nueva York.

[4] Charnes, A., Cooper, W.W., Ferguson, R. O. (1955), Optimal Estimation of

Executive Compensation by Linear Programming, Management Science 1,

103-194.

[5] Dalkey, N.C. (1969), The Delphi method: an experimental study of group

opinion, RM-5888-PR. The Rand Corporation, Santa Monica.

[6] Figueira, J. Greco, S., Ehrgott, M. (Eds.) (2005), Multiple criteria decision

analysis: State of the art surveys. Springer, Berlın.

[7] Fishburn, P.C. (1970), Utility theory for decision making. John Wiley &

Sons, Nueva York.

[8] Glover, F.; Kochenberger, G.A. (2003), Handbook of metaheuristics. Inter-

national Series in Operations Research & Management Science 57, Springer,

Berlın.

[9] Harris, R. (1998), Introduction to decision making, VirtualSalt. http://

www.virtualsalt.com/crebook5.htm

[10] Hwang, C.-L., Lin, M.-J. (1987), Group decision making under multiple cri-

teria, Lecture Notes in Economics and Mathematical Systems 281, Springer,

Berlın.

[11] Kahraman, C. (Ed.) (2008), Fuzzy multi-criteria decision making. Theory

and applications with recent developments. Springer, Berlın.

[12] Keeney, R.L., Raiffa, H. (1993), Decision with multiple objectives: prefe-

rence and value tradeoffs. Cambridge University Press, Nueva York.

[13] Kilgour, M., Eden, C. (Eds.) (2010), Handbook of group decision and ne-

gotiation, Advances in Group Decision and Negotiation 4, Springer, Berlın.

[14] Koopmans T. (1951), Activity analysis of production and allocation. John

Wiley & Sons, Nueva York.

[15] Kuhn, H.W., Tucker, A.W. (1951), Nonlinear programming, Proceedings of

the Second Berkeley Symposium on Mathematical Statistics and Probability,

481-492. University of California Press.

[16] Raiffa, H. (1982), The art and science of negotiation. Harvard University

Press, Cambridge.

http://www.virtualsalt.com/crebook5.htm

http://www.virtualsalt.com/crebook5.htm

96 A. Jimenez Martın

[17] Ribeiro, C.C., Hansen, P. (Eds.) (2002), Essays and surveys in metaheurs-

tics. Kluwer Academic Publishers, Dordrecht.

[18] Romero, C. (1991) Handbook of Critical Issues in Goal Programming. Per-

gamon Press, Oxford.

[19] Roy, B. (1968), Classement et choix en presence de points de vue mul-

tiples (la methode ELECTRE), La Revue d’Informatique et de Recherche

Operationelle 8, 57-75.

[20] Saaty, T.L. (1980), The Analytic Hierarchy Process. McGraw-Hill, Nueva

York.

[21] Von Neumann, J., Morgenstern (1947), Theory of games and economic

behavior. Princeton University Press, Princeton.

[22] Yu, P.L. (1985) Multiple criteria decision making: Concepts, techniques and

extensions. Plenum, Nueva York.

[23] Zadeh, L.A. (1963) Optimality and non-scalar-valued performance criteria.

IEEE Transactions on Automatic Control 8, 59-60.

[24] Zeleny, M. (1973) Compromise programming, En: Multiple Criteria Deci-

sion Making (Cochrane y Zeleny, eds.). University of South Carolina Press,

262-301.

[25] Zeleny, M. (1974) A concept of compromise solutions and the method of

the displaced ideal. Computers and Operations Research 1, 479-496.


Estadıstica

Quantile regression: estimation and lack-of-fit tests

Mercedes Conde-Amboage, Wenceslao Gonzalez-Manteiga

and Cesar Sanchez-Sellero

Department of Statistics, Math. Analysis and Optimization

Universidade de Santiago de Compostela

B [email protected], B [email protected],

B [email protected]

Abstract

Although mean regression achieved its greatest diffusion in the twen-

tieth century, it is very surprising to observe that the ideas of quantile

regression appeared earlier. While the beginning of the least-squares re-

gression can be dated in the year 1805 by the work of Legendre, in the

mid-eighteenth century Boscovich already adjusted data on the ellipticity

of the Earth using concepts of quantile regression.

Quantile regression is employed when the aim of the study is cen-

tred on the estimation of the different positions (quantiles). This kind of

regression allows a more detailed description of the behaviour of the re-

sponse variable, adapts to situations under more general conditions of the

error distribution and enjoys robustness properties. For all that, quantile

regression is a very useful statistical technology for a large diversity of

disciplines. In this paper a review on quantile regression methods will be

presented.

Keywords: Quantile regression, Estimation, Lack-of-fit tests, Robust-

ness, Sparsity.

AMS Subject classifications: 62J05, 62G08, 62F35, 62F03.

1. Introduction

Given a random variable X, for each 0 < τ < 1 its τ-th quantile, that will

be denoted by cτ, is defined as the value that verifies

PX

(X ≤ cτ) ≥ τ and PX

(X ≥ cτ) ≥ 1− τ.

c© 2018 SEIO

98 M. Conde-Amboage, W. Gonzalez-Manteiga, C. Sanchez-Sellero

Then, the quantile function of a probability distribution is given by the

inverse of the cumulative distribution function. More formally, the quantile

function is defined as follows

F−1X

(τ) = inf {x ∈ R : τ ≤ FX

(x)} ,

where inf{A} represents the infimum of a subset A. The infimum is a criterion

used to choose a simple quantile when the definition in terms of the probability

function provides more than one solution.

Quantiles can be computed as the result of an optimization problem. First,

let us call quantile loss function to the following piecewise linear function:

ρτ(u) = u(τ− I(u < 0)

)=

{u τ if u ≥ 0,

u (τ− 1) if u < 0,

where I represents the indicator function of an event. Figure 1 shows the repre-

sentation of the quantile loss function for different values of the τ-th quantile of

interest. Note that the quantile loss function is not differentiable so that stan-

dard numerical algorithms cannot be directly applied. Because of this reason,

most of the theory developed for mean estimation can not be applied in this

context.

−1.0 −0.5 0.0 0.5 1.0

0.0

00

.25

0.5

00

.75

(a) Quantile τ = 0.25

−1.0 −0.5 0.0 0.5 1.0

0.0

00

.25

0.5

00

.75

(b) Quantile τ = 0.50

−1.0 −0.5 0.0 0.5 1.0

0.0

00

.25

0.5

00

.75

(c) Quantile τ = 0.75

Figure 1: Representation of the quantile loss function for three different valuesof the τ-th quantile of interest: τ = 0.25 (Part a), τ = 0.50 (Part b) and τ = 0.75(Part c).

Quantile regression: estimation and lack-of-fit tests 99

Thereupon, for each τ ∈ (0, 1), the τ-th quantile that has been denoted by

cτ can be written as

cτ = arg minx

E[ρτ(X − x)

].

In practice, the cumulative distribution function F is replaced by the empirical

distribution function. So, given {X1, . . . , Xn} a random sample of the variable

X, the sample quantiles can be computed as

cτ = arg minc

∫ρτ(x− c) dFn(x) = arg min

c

1

n

n∑i=1

ρτ(Xi − c) (1.1)

for each τ ∈ (0, 1).

The problem of finding the τ-th sample quantile may be reformulated as a

linear problem. A more complete explanation about this optimization problem

can be found in Section 1.1.2 of [12]. In practice, there exists several methods in

order to compute sample quantiles, and a clear review about these possibilities

in R language is detailed in [24].

The asymptotic distribution of cτ can be derived as a consequence of Linde-

berg’s central limit theorem. This result is gathered in Theorem 1.1 and its proof

is detailed in several classical works on Statistical Inference, see for instance [6].

Theorem 1.1. Given a random variable X with associated cumulative distri-

bution function FX

that is absolutely continuous in a neighbourhood of the τ-th

quantile of interest, cτ, with fX

(cτ) > 0. Then, the asymptotic distribution of

the sample quantile, cτ, is given by

√n (cτ − cτ)

d−→ N(0, ω2),

where ω2 = τ(1− τ)/f2X

(cτ), N(0, ω2) represents the Gaussian distribution with

zero mean and variance ω2, andd−→ denotes convergence in distribution.

According to the asymptotic distribution of cτ, the inverse of the density

evaluated in the quantile, that is known in this context as sparsity, will play a

crucial role in this context. A complete description of the sparsity function will

be presented in Section 4.

It is well-known the major robustness of quantile methods versus classical

least squares estimation. To show that, we are going to focus on the influence

function, introduced by [20]. The influence function describes the effect of an

anomalous sample point over a certain estimator. More formally, the influence

function can be defined by

IF (y, γ, F ) = limt→0

γ(Ft)− γ(F )

t,


where γ(F ) represents an estimator that depends on a distribution F and Ft =

(1− t)F + tδy where δy denotes the distribution function that assigns mass 1 to

the contaminated point y.

So, the influence function associated with mean estimator (denoted by µ)

will be given by

IF (y, µ, F ) = limt→0

µ(Ft)− µ(F )

t= y − µ(F ),

while the influence function of median estimator (denoted by c0.5) will be given

by

IF (y, c0.5, F ) = limt→0

c0.5(Ft)− c0.5(F )

t=

0.5 sgn(y − c0.5(F ))

f(c0.5(F )),

where sgn represents the sign function.

There is a fundamental difference between the two influence functions. While

the influence function of the mean, is simply proportional to y, the influence of

contamination at y on the median is bounded by the sparsity at the median.

Figure 2 shows the comparison of the influence functions of mean and median

estimators associated with a standard Gaussian distribution F . Let us observe

the fragility of the mean and the robustness of the median in withstanding the

contamination of outlying observations. Much of what has already been said

extends immediately to the quantiles generally for any τ, and from them to

quantile regression.

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

y

MeanMedian

Figure 2: Influence function associated with mean and median estimators, whereF is a standard Gaussian distribution.

Taking into account the good properties of sample quantiles, we are going

to extend these ideas to a regression context with a parametric (see Section 2)

and a nonparametric (see Section 3) perspective. In Section 4 we have estab-


lished different sparsity estimators because of its fundamental role in quantile

regression context. In Section 5 an introduction to lack-of-fit tests for quantile

regression is presented. Finally, in Section 6 some conclusions are presented.

2. Parametric quantile regression

Now, our main goal will be to extend the theory developed in the previ-

ous section to the regression context. Then, for simplicity, let us consider the

following linear regression model:

Yi = θ′τPi + εi, (2.1)

where Pi = (1, Xi) and {(X1, Y1), · · · , (Xn, Yn)} represents a random sample of

the response variable (denoted by Y ∈ R) and the explanatory variable (denoted

by X ∈ Rd). Moreover, the errors εi should verify that P(εi ≤ 0 | X = Xi) = τ,

that is, its conditional τ-th quantile is zero. Note that it is analogous to assume

that E(εi|X = Xi) = 0 in the classical least squares context.

If the conditional quantile function is defined by qτ(x) = θ′τ(1, x), in view of

(1.1), it is reasonable to consider the estimator θτ obtained as the solution of

the following optimization problem:

θτ = arg minθ∈Rd+1

n∑i=1

ρτ(Yi − θ′Pi). (2.2)

This idea has been introduced by [26] and subsequently [14] demonstrated the

consistency of the quantile regression estimator.

Following the ideas described in Section 1, the parameter θτ can be obtained

as the solution of the following linear optimization problem:

min(θ,u,v)∈Rd+1×R2n

+

{τ1′nu+ (1− τ)1′nv : Xθ + u− v = Y

}, (2.3)

where X denotes the regression design matrix that is a n× (d+ 1) matrix whose

j−th row is given by (1, Xj)′ and 1n represents a n-dimensional vector of ones.

The residual vector Y − Xθ has been split into its positive and negative parts

(u and v respectively).

The calculus of the quantile regression parameter as a linear optimization

problem is crucial because it gives place to different methods in order to compute

θτ. In this line, [3] proposed a modified version of the Simplex method in order

to solve the optimization problem associated with τ = 0.5 in which case the

quantile loss function is the absolute value. It is important to emphasize that

[3]’s proposal manages to reduce substantially the computational time needed

to compute the estimator θτ for τ = 0.5 compared with the original Simplex


algorithm. Later, [27] extended this development to each quantile 0 < τ < 1.

Since quantile regression estimators do not have explicit expression, it would

be necessary to resort to asymptotic expressions such as Bahadur’s representa-

tion. If we assume that ψτ(r) = τI(r > 0) + (τ − 1)I(r < 0), [2] established

that

√n(θτ − θτ

)= D−1

1 n−1/2n∑i=1

Pi ψτ(Yi − θ′τPi) +Op

(n−1/4

√log n

),

under certain regularity conditions.

Differently from least squares estimator, the quantile estimator distribution

is not generally known even under error normality. [25] showed the following

result about the asymptotic distribution of quantile regression estimators.

Theorem 2.1. Let us consider a linear model as given in (2.1). Under the

following conditions:

Condition A1. The conditional distribution functions Fi (Yi conditioned to

Xi) are absolutely continuous with continuous density functions fi uni-

formly bounded away from 0 and ∞ at the conditional quantiles ci(τ).

Condition A2. There exist positive definite matrices D0 and D1(τ) such that

1. limn→∞1n

∑ni=1 Pi P

′i = D0,

2. limn→∞1n

∑ni=1 fi(ci(τ))Pi P

′i = D1(τ),

3. maxi=1,...,n ‖Xi‖/√n→ 0,

it follows that

√n

(θτ − θτ

)d−→ N

(0, τ(1− τ)D1(τ)−1D0D1(τ)−1

).

Again, in view of Theorem 2.1, it is clear that the sparsity function will play

an important role. Furthermore, in a regression context, the quantile methods

still enjoys properties of robustness. [11] (page 106) showed that the influence

function associated with the least squares estimator (denoted by θLS) is given

by

IF ((x, y), θLS , F ) = E(XX′)−1(1, x)(y − θLS(F )′(1, x)),

where F represents the distribution function of the random vector (X,Y ) and

the pair (x, y) denotes a new observation. In this case, the influence function

can be split into two factors

IP (x, θLS , FX) = E(XX′)−1(1, x),

IR(r, θLS , Fε) = r = y − θLS(F )′(1, x),


where FX represents the marginal distribution of the explanatory variable, Fεdenotes the error distribution and r = y − θLS(F )′(1, x) represents the residual

associated with a pair (x, y).

In this sense, the factor IP represents the influence of the new observation

x. This is closely related to the well-known leverage problem in the regression

context. In addition, the factor IR contains the influence of the residual, that

is, the effect of a deviation of the response variable y.

Considering now the quantile regression estimator (see equation (2.2)), the

influence function can be split into the following two parts:

IP (x, θτ, FX) = E(XX′)−1(1, x),

IR(r, θτ, Fε) = sgn(r) = sgn(y − θτ(F )′(1, x)

).

Then, the influence due to the new observation x matches with the least squared

estimator while the influence due to the residual coincides with the influence of

the quantile estimator without covariates.

It can then be established that quantile regression can correct robustness

problems due to vertical deviations (that is, related to the response variable),

but not those caused by horizontal deviations (that is, related to the explana-

tory variables). Furthermore, in order to control both factors of the influence

function, it should be necessary to introduce generalized M-estimators that

were studied by [31]. Moreover, other kinds of robust estimators have been con-

sidered such as least median of squares regression proposed by [34] or regression

depth proposed by [35].

We have focused on linear quantile regression but all the ideas presented in

this section can be extended to non linear context. Let us consider the following

regression scenario:

Yi = qτ(Xi, θτ) + εi,

where the function qτ is known apart from the parameter θτ and {(X1, Y1), . . . ,

(Xn, Yn)} represents a random sample of the variables (X,Y ) ∈ Rd+1. More-

over, the conditional τ-quantile of the errors is zero. In this context, we can

considerer the following estimator

θτ = arg minθ∈Rq

n∑i=1

ρτ(Yi − qτ(Xi, θ)). (2.4)

In Section 4.5 of [25], the asymptotic behaviour of estimator (2.4) is presented.

This result is an extension of Theorem 2.1. Moreover, in some situations, in

order to get more flexible approaches, it will be necessary to introduce nonpara-

metric techniques that will be introduced in the Section 3.


3. Nonparametric quantile regression

All the methodology developed along the previous section can be extended

to a nonparametric context. In this line, [7] and [8] can be considered as seminal

works. In this section we are going to focus on local linear smoothing techniques.

Let us consider a regression scenario as

Y = qτ(X) + ε,

where the conditional τ-quantile of the error given the covariate is zero. Given

a random sample of independent observations {(X1, Y1), . . . , (Xn, Yn)} of the

pair (X,Y ) ∈ R2, a nonparametric estimator of the conditional quantile can be

defined as qτ,hτ(x) = a, where a and b are the minimizers of

n∑i=1

ρτ (Yi − a− b(Xi − x))K

(Xi − xhτ

),

where K is a kernel function (usually a symmetric density) and hτ represents a

bandwidth parameter. This is the local linear estimator of the quantile regres-

sion function.

As it happens for any smoothing method, bandwidth hτ exhibits a strong

influence on the resulting estimate. Several authors have addressed the problem

of bandwidth selection, see [43], [1], [44] or [18].

One of the main approaches to bandwidth selection is the plug-in technique

which consists of minimizing the dominant terms of the mean integrated squared

error (MISE) of the estimator. [17] established the asymptotic MISE for the

local linear quantile regression when n → ∞, hτ = hτ(n) → 0 and nhτ → ∞,

that is given by

MISE (qτ,hτ) ∼=

1

4h4τµ2(K)2

∫q(2)τ (x)2g(x) dx

+R(K)τ(1− τ)

nhτ

∫1

f(qτ(x)|X = x)2dx, (3.1)

where g is the density of X, f(qτ(x)|X = x) is the conditional density of Y at

qτ(x) given X = x, q(i)τ (x) = ∂iqτ(x)/∂xi, µi(K) =

∫uiK(u) du and R(K) =∫

K2(u) du.

Moreover, in view of (3.1), an asymptotically optimal bandwidth can be

derived as

hAMISE,τ =

[R(K) τ(1− τ)

n µ2(K)2∫q

(2)τ (x)2 g(x) dx

∫1

f(qτ(x)|X = x)2dx

]1/5

. (3.2)


Note that µ2(K) and R(K) are obtained from the kernel function, while the

two integrals in (3.2) are unknown and have to be estimated. Expression (3.2)

is quite similar to the plug-in rule for mean regression but again the sparsity

function will play an important role. Because of these similarities with mean

regression, [43] proposed to use [36] bandwidth selector with some simple trans-

formations based on the assumptions of homoscedasticity (it is useful to have

the same curvature for any τ as in mean regression) and error normality (it

allows to estimate the sparsity from the conditional variance). As a result, Yu

and Jones (1998) plug-in rule proposal is derived

hτ,YJ = 5

√τ(1− τ)

φ(Φ−1(τ))2hRSW, (3.3)

where hRSW is selected by the plug-in rule proposed by [36].

On the other hand, [1] suggested a modification of classical cross-validation

function that consisted of replacing the squared loss criterion by the quantile

loss function. Bearing this idea in mind, a cross-validation procedure can be

applied to select the bandwidth parameter associated with a kernel quantile

regression, as follows

hτ,CV = arg minh

CV(h) = arg minh

n∑i=1

ρτ

(Yi − q−iτ,h(Xi)

),

where q−iτ,h(Xi) is the estimator of the τ-th quantile function obtained from a

sample without the i-th individual, that is, the classical leave-one-out estimator,

evaluated with bandwidth h.

More recently, [9] provided a plug-in bandwidth for local linear quantile

regression based on expression (3.2) without imposing restrictions on the condi-

tional variability and the error distribution. Instead, nonparametric estimations

of the curvature at the given quantile τ will be used, as well as nonparametric

estimations of the sparsity. Moreover, they prove the convergence of their plug-

in estimator to the optimal bandwidth and the convergence rate is the same

that in the classical mean regression context.

The aforementioned methods can be extended to the case of a multi-dimen-

sional covariate. For instance, [44] extends the ideas of [43] to nonparametric

additive models. Again, the goal is to reduce the problem to a mean regression

context under assumptions of homoscedasticity and error normality and then

use the selector presented by [32].

Finally, during this section, we focus on kernel smoothing techniques, al-

though spline methods have been widely studied by several authors as [29] or

[28]. For instance, [29] proposed to estimate the function qτ by solving the


following optimization problem

min

[n∑i=1

ρτ

(Yi − qτ(Xi)

)+ λV(∇qτ)

], (3.4)

where V(∇qτ) denotes the total variation of the derivative of qτ and λ represents

the well-know smoothing parameter in this context. Moreover, [29] showed that

the solution to (3.4) is a linear spline with nodes at the points Xi where i =

1, . . . , n. Hence, a quantile smoothing spline model can be fitted using l1−type

linear programming techniques. They also proposed to adapt the information

criterion of [37] for the choice of the smoothing parameter λ involved in problem

(3.4).

4. The sparsity function

In view of the asymptotic behaviour of the univariate, parametric and non-

parametric quantile regression estimators, it will be necessary to estimate the

inverse of the density function evaluated at the quantile of interest. In the re-

gression setup, this function plays an analogous role to the standard deviation

of the errors in least squares estimation of the mean regression model.

It is perfectly natural that the precision of quantile estimates should depend

on the inverse of the density because it reflects the density of observations near

the quantile of interest. If the data are very sparse at the quantile of interest,

this quantile will be difficult to estimate. On the other hand, when the sparsity

is low and the density is high, the quantile is more precisely estimated.

We are going to start studying the sparsity function associated with a uni-

variate variable, without considering covariates or a regression scenario. Let us

consider a random variable Y with associated distribution and density function

denoted by FY

and fY

, respectively. [40] named sparsity function to the

inverse of the density function evaluated at the quantile, that is given by

s(τ) =1

fY

(F−1Y

(τ)).

Let us observe that the sparsity function is simply the derivative of the quantile

function, that is,∂

∂tF−1Y

(t) =1

fY

(F−1Y

(t))= s(t).

Given Y = {Y1, · · · , Yn} a random sample of the variable Y , [38] proposed

to estimate the sparsity by a simple difference quotient of the empirical quantile


function, that is,

s(t) =F−1n (t+ h)− F−1

n (t− h)

2h=Y[n(τ+h)] − Y[n(τ−h)]

2h, (4.1)

where F−1n is the empirical quantile function and h is a bandwidth that tends

to zero as the sample size tends to infinity, as well, Y[z] are order statistics.

Moreover, [n(τ± h)] are neighbouring orders to τ where [a] denotes the integer

part of a. Later, [4] showed that the value of the smoothing parameter that

minimizes the asymptotic mean squared error of (4.1) is of order n−1/5.

[5] proposed a bandwidth selector in order to compute the nonparametric

estimator of the sparsity. In addition, the author proved that the bandwidth

hB = 5

√4.5s(τ)2

s(2)(τ)2n−1/5

is optimal from the standpoint of minimizing the mean squared error, where

s(2)(τ) = ∂2

∂τ2 s(τ).

On the other hand, [19] examined the effect that the selection of the smooth-

ing parameter has on the empirical level of tests or confidence intervals coverage

based on Studentized quantiles. In this line, they showed that if we would like

to minimize this error, the bandwidth should be of smaller order than that re-

quired by squared error theory, such as [5]’s proposal. Bearing this idea in mind,

[19] proposed the following smoothing parameter:

hHS = z2/3α/2

3

√1.5Sd,n|Vh,n|

n−1/3,

where

Sd,n =n

2d

(Y[t+d] − Y[t−d]

),

Vh,n = 0.5

(n

h

)3

(Y[r+2h] − 2Y[r+h] + 2Y[r−h] − Y[r−2h]),

t = [nτ] + 1, d = 0.5n4/5, r = [0.5n] + 1, h = 0.25n8/9 and zα/2 satisfies that

Φ(zα/2) = 1 − α/2 with α = 0.05 where Φ represents the standard Gaussian

distribution.

Now, we are going to move to a regression scenario. Let us consider (X1, Y1),

· · · , (Xn, Yn) a random sample of two variables (X,Y ) ∈ Rd+1 drawn from a

linear quantile regression model such as (2.1). In this situation, [22] proposed


to estimate the density of the response variable Y given X = Xi as follows

fi =2hHS

(θτ+ − θτ−)′Pi,

where hHS represents a smoothing parameter associated with sparsity estimation

for Y (without regression) as that given by [19] and θτ+ and θτ− represent the

estimated coefficients of the linear model for neighbouring quantiles

τ+ =[nτ] + nhHS + 1

nand τ− =

[nτ]− nhHS + 1

n.

In finite samples, [22] proposed the following modified estimator to combat

possible crossing quantiles estimations:

fi = max

0,2hHS(

θτ+ − θτ−

)′Pi − δ

,

where δ is a small positive constant included in order to avoid zero denominator.

[22]’s proposal is based on supposing a global linear model, and intended to

make inference about its coefficients. To this end the sparsity was estimated by1

fiusing information of neighbouring quantiles. This procedure will properly

work only when the relation between X and Y could be fitted by a linear model

for different values of the τ.

The study of the sparsity function in a general regression context has not

been thoroughly analysed in the literature. [9] presented the first nonparametric

sparsity estimator for regression context. Since the sparsity results to be the

derivative of the quantile regression function, qτ(x), with respect to τ, they

propose an estimate of this kind

sτ,ds,hs(x) =qτ+ds,hs(x)− qτ−ds,hs(x)

2 ds, (4.2)

where qτ+ds,hs and qτ−ds,hs are local linear quantile regression estimates at

the quantile orders (τ + ds) and (τ − ds), respectively, and hs denotes their

bandwidth.

Note that two pilot bandwidths, ds and hs, are needed to use estimator

(4.2). The bandwidth ds is placed in the Y -axis and plays a similar role to that

of the bandwidth dj in the rule of thumb. The bandwidth hs is necessary to

compute the nonparametric estimations of the regression functions. In order to

select these smoothing parameters, it can be use the plug-in technique which

consists of minimizing the dominant terms of the mean integrated squared error

(MISE) of the estimator given in (4.2). [9] presented the mean squared error of


this sparsity estimator to obtain optimal bandwidths ds and hs.

5. Lack-of-fit tests for quantile regression

The lack-of-fit (or in opposite terms, goodness-of-fit) of a statistical model

describes how well it fits a set of observations. At the beginning of the twentieth

century, Pearson introduced the term goodness-of-fit which main goal is to mea-

sure the discrepancy between observed values and the values expected under a

specific model. Along this section we are going to present a brief introduction

to lack-of-fit tests for quantile regression models.

Let us consider a regression model associated with a quantile of interest

τ ∈ (0, 1),

Y = qτ(X) + ε,

where ε is the unknown model error of the model that should verify that P(ε ≤0|X) = τ. In this new scenario, the main goal will be to carry out the following

lack-of-fit test:{H0 : qτ ∈ Qθ =

{qτ(·, θ) : θ ∈ Θ ⊂ Rq

}Null hypothesis

Ha : qτ /∈ Qθ Alternative hypothesis

that is equivalent to

H0 : E [I(Y ≤ qτ(X, θτ)) | X] = τ,

for some θτ ∈ Θ ⊂ Rq.

Then, given {(X1, Y1), · · · , (Xn, Yn)} a random sample of the variables (X,Y )

∈ Rd+1, we are going to review different goodness-of-fit tests in the quantile re-

gression context available from the literature.

Lack-of-fit tests based on smoothing ideas

Regarding the lack-of-fit tests for quantile regression based on smoothing

ideas, we should highlight the work developed by [46] that extends the well-

known test proposed by [45] to the quantile regression setup. In this case, the

test statistic is given by

T Z

n =nhd/2

σ

1

n(n− 1)

∑i6=j

1

hdK

(Xi −Xj

h

)[I(Yi ≤ qτ(Xi, θτ)

)− τ

]

×[I(Yj ≤ qτ(Xj , θτ)

)− τ

], (5.1)


where K is the kernel function, h is the smoothing parameter and

σ2 = 2τ2(1− τ)2 1

n(n− 1)

∑i6=j

1

hdK2

(Xi −Xj

h

).

The statistic (5.1) converges to a Gaussian distribution. It should be noted

the well-known problem associated with the selection of the smoothing param-

eter, h.

Following the idea of [46], [13] proposed a lack-of-fit test for additive quantile

models based on smoothing ideas. In this context, the following test could be

raised:

H0 : qτ(X) = qτ(X(1), · · · , X(d)) =

d∑i=1

qτ,i

(X(i)

)+ c(τ),

where X = (X(1), · · · , X(d)) ∈ Rd denotes the explanatory variable.

Given a random sample of the variables (X,Y ) ∈ Rd+1, [13] proposed the

following test statistic:

TDGN

n =1

n(n− 1)hd

n∑i=1

n∑j 6=i

K

(Xi −Xj

h

)Ri Rj ,

where K represents the kernel function, h is the smoothing parameter and

Ri = I(Yi ≤ q−iτ (Xi))− τ,

where q−iτ (Xi) denotes an additive estimation of the quantile regression function

without considering the i-th observation. Despite having obtained the asymp-

totic convergence to a Gaussian distribution, it is more recommended to use a

bootstrap procedure in order to calibrate this test.

Lack-of-fit tests based on empirical regression processes

Extending the work developed by [39] to the quantile regression setting, [21]

proposed an omnibus lack-of-fit test for parametric quantile regression based on

a cumulative sum process of the gradient vector. That is, [21] based their test

on the process

RHZ

n = n−1/2n∑i=1

ψτ

(Yi − qτ(Xi, θτ)

)q(1)τ (Xi, θτ) I(Xi ≤ t), (5.2)


where ψτ(r) = τI(r > 0) + (τ− 1)I(r < 0), q(1)τ (x, θ) = ∂

∂θ qτ(x, θ), and θτ is an

estimator of θτ. The test statistic proposed by [21] is then defined as

THZ

n = largest eigenvalue of n−1n∑i=1

RHZ

n (Xi) RHZ

n (Xi)′.

[21] proved that the empirical process (5.2) converges to a Gaussian processwith mean 0 and covariance function

W (t1, t2) = τ(1 − τ) E(q(1)τ (X, θτ) q(1)τ (X, θτ)′ I(X ≤ min(t1, t2)) − S(t1)S−1S(t2)

),

where

S = E[q(1)τ (X, θτ) q(1)

τ (X, θτ)′],

S(t) = E[q(1)τ (X, θτ) q(1)

τ (X, θτ)′ I(X ≤ t)].

Given that simulating the Gaussian process is not easy, [21] proposed a

multiplier bootstrap in order to calibrate their test.

Lack-of-fit tests designed for avoiding the curse of dimensionality

It is well-known that a high (or even moderate) dimension of the covariate

may affect the performance of the specification tests. In this line, [42] used a He

and Zhu type test and defined some ranks over the covariate in order to test a

linear quantile regression model. He considered the following empirical process:

RW

n (t) = n−1/2n∑i=1

ψτ(ri)Pi I(Fk ≤ t),

where ri = Yi − θ′τPi represents the residuals and Fi = maxUij where Uijrepresents the ranks of the n values of the j−th column of the design matrix,

represented by X1 for each j = 2, · · · , d+ 1. Consequently, the test statistic will

be

TW

n = largest eigenvalue of

∫RW

n (t)[RW

n (t)]′dFn,W (t),

where Fn,W is the empirical distribution function of the variables Fi.

The proposal of [42] has the virtue of simplicity but does not provide an

omnibus test, i.e., it is not consistent for all alternatives. To solve this problem

[10] presented an omnibus lack-of-fit test for quantile regression models, that

is suitable even with high-dimensional covariates. This test is based on the

cumulative sum of residuals with respect to unidimensional linear projections

1The design matrix is a n × (d + 1) matrix which j−th row is given by (1, Xj)′ where

{X1, · · · , Xn} is a random sample of the explanatory variable X.


of the covariates following the ideas of [15] for mean regression context. Their

test statistic is defined as

TCSG

n = largest eigenvalue of

∫Π

RCSG

n (β, u)[RCSG

n (β, u)]′Fn,β(du)dβ,

where

RCSG

n (β, u) = n−1/2n∑i=1

ψτ (Yi − qτ(Xi, θτ)) q(1)τ (Xi, θτ) I (β′Xi ≤ u) ,

Π = Sd × [−∞,+∞], Sd is the unit sphere on Rd, and Fn,β is the empirical

distribution of the projected covariates β′X1, . . . , β′Xn.

On the other hand, [30] adapted the ideas of [46] to a multivariate sce-

nario. The main difference between both tests is that [30]’s work only involves

unidimensional kernel smoothing, so that the rate at which it detects local al-

ternatives does not depend on the dimension of covariate. This lack-of-fit test

is based on the following test statistic:

TMLP

n =nh1/2

σ

1

n(n− 1)

∑i6=j

1

hK

(Wi −Wj

h

)ψ(Zi − Zj)

×[I(Yi ≤ qτ(Xi, θτ)

)− τ

][I(Yj ≤ qτ(Xj , θτ)

)− τ

],

where

σ2 =2τ2(1− τ)2

n(n− 1)

∑i 6=j

1

hK

(Wi −Wj

h

)2

ψ(Zi − Zj)2,

K and ψ are bounded, even, integrable functions with (almost everywhere)

positive, and h represent the univariate smoothing parameter. Note that they

assumed that the covariate can be written as X = (W,Z) ∈ Rd where W is a

unidimensional continuous random variable while Z may include both continu-

ous and discrete variables.

We have mentioned some examples, but other specification tests for quan-

tile regression models can be found in the literature as well as [23] whose goal

was to test if the conditional median function is linear against a nonparametric

alternative with unknown smoothness; [41] considered an empirical likelihood

method to estimate the parameters of the quantile regression models and to

construct confidence regions; [33] considered two empirical likelihood-based es-

timation, inference, and specification testing methods for quantile regression

models; or [16] introduced a nonparametric test for the correct specification of

a linear conditional quantile function over a continuum of quantile levels.


6. Conclusions

Although mean regression is still a traditional benchmark in regression stud-

ies, the quantile approach is receiving increasing attention, because it allows a

more complete description of the conditional distribution of the response given

the covariate, and it is more robust to deviations from error normality. That is,

while classical regression gives only information on the conditional expectation,

quantile regression extends the viewpoint on the whole conditional distribution

of the response variable.

Along this work an introduction to quantile regression methods is presented.

Parametric and nonparametric methods have been introduced and the main

advantages of these procedures were mentioned. Finally, some lack-of-fit tests

for quantile regression have been shown.

Acknowledgements. The authors gratefully acknowledge the support of

Projects MTM2013–41383–P (Spanish Ministry of Economy, Industry and Com-

petitiveness) and MTM2016–76969–P (Spanish State Research Agency, AEI),

both co–funded by the European Regional Development Fund (ERDF). Support

from the IAP network StUDyS, from Belgian Science Policy, is also acknowl-

edged.

References

[1] Abberger, K. (1998). Cross-validation in nonparametric quantile regres-

sion. Allgemeines Statistisches Archiv, 82, 149-161.

[2] Bahadur, R. R. (1966). A note on quantiles in large samples. The Annals

of Mathematical Statistics, 37, 577-580.

[3] Barrodale, I. and Roberts, F. D. K. (1973). An improved algorithm for

discrete L1 linear approximation. SIAM Journal on Numerical Analysis,

10, 839-848.

[4] Bloch, D. A. and Gastwirth, J. L. (1968). On a simple estimate of the

reciprocal of the density function. The Annals of the Mathematical Statis-

tics, 39, 1083-1085.

[5] Bofinger, E. (1975). Estimation of a density function using order statis-

tics. Australian Journal of Statistics, 17, 1-7.

[6] Chatterjee, A. (2011). Asymptotic properties of sample quantiles from a

finite population. Annals of the Institute of Statistical Mathematics, 63,

157 - 179.


[7] Chaudhuri, P. (1991a). Nonparametric estimates of regression quantiles

and their local Bahadur representation. The Annals of Statistics, 19,

760-777.

[8] Chaudhuri, P. (1991b). Global nonparametric estimation of conditional

quantile functions and their derivatives. Journal of Multivariate Analysis,

39, 246-269.

[9] Conde-Amboage, M. and Sanchez-Sellero, C. (2018). A plug-in band-

width selector for nonparametric quantile regression. TEST. https:

//doi.org/10.1007/s11749-018-0582-6.

[10] Conde-Amboage, M., Sanchez-Sellero, C. and Gonzalez-Manteiga, W.

(2015). A lack-of-fit test for quantile regression models with high-

dimensional covariates. Computational Statistics & Data Analysis, 88,

128 - 138.

[11] Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regres-

sion. Chapman and Hall.

[12] Davino, C., Furno, M. and Vistocco, D. (2014). Quantile regression: the-

ory and applications. John Wiley & Sons.

[13] Dette, H., Guhlich, M., and Neumeyer, N. (2015). Testing for additivity

in nonparametric quantile regression. Annals of the Institute of Statistical

Mathematics, 67, 437-477.

[14] El Bantli, F. and Hallin, M. (1999). L1-estimation in linear models with

heterogeneous white noise. Statistics & Probability Letters, 45, 305-315.

[15] Escanciano, J.C. (2006). A consistent diagnostic test for regression mod-

els using projections. Econometric Theory, 22, 1030-1051.

[16] Escanciano, J.C. and Goh, S.C. (2014). Specification analysis of linear

quantile models. Journal of Econometrics, 178, 495-507.

[17] Fan, J., Hu, T. C. and Truong, Y. K. (1994). Robust nonparametric

function estimation. Scandinavian Journal of Statistics, 21, 433-446.

[18] El Ghouch, A. and Genton, M. G. (2012). Local polynomial quantile

regression with parametric features. Journal of the American Statistical

Association, 104, 1416-1429.

[19] Hall, P. and Sheather, S. J. (1988). On the distribution of a studentized

quantile. Journal of the Royal Statistical Society. Series B (Methodolog-

ical), 50, 381-391.

https://doi.org/10.1007/s11749-018-0582-6

https://doi.org/10.1007/s11749-018-0582-6


[20] Hampel, F. R. (1974). The influence curve and its role in robust estima-

tion. Journal of the American Statistical Association, 69, 383-393.

[21] He, X. and Zhu, L.-X. (2003). A lack-of-fit test for quantile regression.

Journal of the American Statistical Association, 98, 1013-1022.

[22] Hendricks, W. and Koenker, R. (1992). Hierarchical spline models for

conditional quantiles and the demand for electricity. Journal of the Amer-

ican Statistical Association, 87, 58-68.

[23] Horowitz, J.L. and Spokoiny, V.G. (2002). An adaptive, rate-optimal

test of linearity for median regression models. Journal of the American

Statistical Association, 97, 822-835.

[24] Hyndman, R. J. and Fan, Y. (1996). Sample quantiles in statistical pack-

ages. The American Statistician, 50, 361 - 365.

[25] Koenker, R. (2005). Quantile Regression. Cambridge: Cambridge Uni-

versity Press.

[26] Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica,

46, 33-50.

[27] Koenker, R. and D’Orey, V. (1987). Computing regression quantiles.

Journal of the Royal Statistical Society. Series C (Applied Statistics),

36, 383-393.

[28] Koenker, R. and Mizera, I. (2004). Penalized triograms: total variation

regularization for bivariate smoothing. Journal of the Royal Statistical

Society. Series B (Statistical Methodology), 66, 145-163.

[29] Koenker, R., Ng, P. and Portnoy, S. (1994). Quantile smoothing splines.

Biometrika, 81, 673-680.

[30] Maistre, S., Lavergne, P. and Patilea, V. (2017). Powerful nonparametric

checks for quantile regression. Journal of Statistical Planning and Infer-

ence, 180, 13 - 29.

[31] Maronna, R.A. and Yohai, V.J. (1981). Asymptotic behavior of general

M-estimates for regression and scale with random carriers. Probability

Theory and Related Fields, 58, 7-20.

[32] Opsomer, J. D. and Ruppert, D. (1998). A fully automated bandwidth

selection method for fitting additive models. Journal of the American



[33] Otsu, T. (2008). Conditional empirical likelihood estimation and infer-

ence for quantile regression models. Journal of Econometrics, 142, 508-

538.

[34] Rousseeuw, P. J. (1984). Least median of squares regression. Journal of

the American Statistical Association, 79, 871-880.

[35] Rousseeuw, P. J. and Hubert, M. (1999). Regression depth. Journal of

the American Statistical Association, 94, 388-402.

[36] Ruppert, D., Sheather, S.J. and Wand, M.P. (1995). An efective band-

width selector for local least squares regression. Journal of the American


[37] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of

Statistics, 6, 461-464.

[38] Siddiqui, M. M. (1960). Distribution of quantiles in samples from a bivari-

ate population. Journal of Research of the National Bureau of Standards,

64B, 145-150.

[39] Stute, W. (1997). Nonparametric model checks for regression. The Annals

of Statistics, 25, 613-641.

[40] Tukey, J. W. (1965). Which part of the sample contains the information,

Proceedings of the National Academy of Sciences, 53, 127-134.

[41] Whang, Y.-J. (2006). Smoothed empirical likelihood methods for quantile

regression models. Econometric Theory, 22, 173-205.

[42] Wilcox, R. R. (2008). Quantile regression: A simplified approach to a

goodness-of-fit test. Journal of Data Science, 6, 547-556.

[43] Yu, K., and Jones, M. C. (1998). Local linear quantile regression. Journal

of the American statistical Association, 93, 228-237.

[44] Yu, K. and Lu, Z. (2004). Local linear additive quantile regression. Scan-

dinavian Journal of Statistics, 31, 333-346.

[45] Zheng, J. X. (1996). A consistent test of functional form via nonpara-

metric estimation techniques. Journal of Econometrics, 75, 263-289.

[46] Zheng, J. X. (1998). A consistent nonparametric test of parametric

regression models under conditional quantile restrictions. Econometric

Theory, 14, 123-138.


Investigacion Operativa

An ABC algorithm for solving the post-disaster resources

distribution problem, a case study

Henry Lamos, Karin Aguilar, Daniel Martınez, Andres Barrera

and Angie Hernandez

Facultad de Ingenierıas Fısico-mecanicas

Universidad Industrial de Santander



B [email protected]

Abstract

This paper addresses a Capacitated Vehicle Routing Problem (CVRP)

to solve the distribution of humanitarian resources in a disaster seismic

event for the city of Bucaramanga, Colombia. The main objective of the

model is to find distribution routes to meet the demand for temporary

shelters. In order to solve the proposed problem, we used an Artificial

Bee Colony (ABC) algorithm modified using evolutionary operators, min-

imizing the total response time. The ABC algorithm was validated in 10

test instances, using the Go programming language. The main contribu-

tion of the work is the construction of a computational tool of geographic

information that supports the decision making in a disaster event.

Keywords: Capacitated Vehicle Routing Problem, Artificial Bee Colony

Algorithm, Humanitarian Resources, Disaster Management

AMS Subject classifications: 90C59, 90C08, 90B06

1. Introduccion

En los ultimos anos la humanidad ha sido testigo de un gran numero de

desastres naturales, los cuales han causado perdidas humanas, ambientales y

economicas significativas. Segun el Centro para la Investigacion sobre la Epi-

demiologıa de los Desastres (CRED) en el periodo de 2010 a 2015 ocurrieron

2.239 desastres, dejando como resultado un total de 442.440 personas fallecidas y

alrededor de US$941 billones en danos. La Agencia Federal de Gestion de Emer-

gencias estadounidense (FEMA) califica como desastre a la ocurrencia de una

catastrofe natural, accidente tecnologico, o un evento provocado por el hombre

c© 2018 SEIO

118 H. Lamos, K. Aguilar, D. Martınez, A. Barrera, A. Hernandez

que da lugar a graves danos a propiedades, muertes y/o heridos multiples. En

este contexto, se entiende la gestion de desastres como el conjunto de procesos

disenados para ser implementados antes, durante y despues de los desastres, que

permitan prevenir o mitigar sus efectos. Las actividades de gestion de desastres

se encuentran enmarcadas en un proceso cıclico de cuatro etapas: mitigacion,

preparacion, respuesta y recuperacion. Las dos primeras contemplan actividades

a realizar previas al evento disruptivo que permiten una reduccion de los efectos

esperados. Mientras que las dos actividades siguientes toman lugar despues de la

ocurrencia del desastre y tienen como objetivo principal atender las necesidades

basicas de la poblacion en el corto plazo y una completa rehabilitacion de la

poblacion en el largo plazo [14].

Como responder de manera eficiente ante un desastre de forma que se minimi-

cen los efectos sobre la poblacion y los recursos, se ha vuelto una cuestion crıtica

en la gestion de desastres, despertando el interes de academicos y profesionales.

Uno de los grandes desafıos en las actividades de respuesta es la movilizacion

de recursos [42], visto como un problema de rutas de vehıculos (VRP) en las

areas afectadas por los desastres, donde se proporcionan bienes y servicios desde

un punto de distribucion a los beneficiarios. El VRP es catalogado de tipo Np-

Hard y fue propuesto por [8] para un problema de distribucion de combustible.

El VRP es uno de los modelos mas populares en la logıstica humanitaria, centra-

do principalmente en la rapidez de la distribucion y satisfaccion de la demanda,

mas que en los costos operativos totales [3]. Una amplia variedad de literatura

en el tema ha sido publicada hasta ahora. Revisiones y estudios de la litera-

tura enfocados en la distribucion de ayudas humanitarias han sido presentados

por [22, 3, 33], quienes reconocen el potencial del area, identificando limitacio-

nes, tendencias y desafıos. Diferentes variantes del VRP han sido aplicados en

problemas de distribucion de ayudas en contextos humanitarios, encontrando

enfoques multi-producto [34, 35, 15, 24]; multi-periodo [45, 44, 39, 46]; multi-

deposito [40, 41] y combinaciones de los anteriores [26, 35, 4, 1, 36, 20]. En

muchos de estos estudios se formula el problema como un modelo de programa-

cion entera, resuelto mediante el uso de metodos exactos bajo relajaciones de

Branch and Bound, Branch and Cut y Branch and Price [9, 10], o de heurısticas

complejas [23, 28, 21, 17, 46, 2, 6, 29]; siendo estos ultimos los mas adecua-

dos para encontrar buenas soluciones en un periodo razonable de tiempo, para

problemas de tamano real.

Debido al entorno dinamico encontrado en las actividades de gestion de

desastre es necesario el desarrollo de herramientas flexibles y robustas que permi-

tan una toma de decisiones agil. Sin embargo, la investigacion en distribucion de

ayudas aun presenta dificultades como herramienta para el tomador de decisio-

nes. Por una parte, aunque en los ultimos anos se ha intentado integrar sistemas

de informacion en los modelos logısticos, los software profesionales carecen de

An ABC algorithm for solving the post-disaster resources distribution problem 119

modelos matematicos sofisticados, y por ende, resultados optimos, aunque tie-

nen interfaces graficas amigables y soluciones rapidas [26]. Por otra parte, la

investigacion academica desarrolla modelos robustos arrojando soluciones que

resultan de difıcil comprension para el tomador de decisiones. Algunas investi-

gaciones en gestion de desastres, han intentado integrar modelos matematicos

a un sistema de informacion geografico, datos en tiempo real y una interfaz

facil de usar [18, 32, 43, 16], con el proposito de combinar las tecnologıas exis-

tentes a herramientas de optimizacion y hacer frente a los desafıos practicos

de las operaciones humanitarias. Vitoriano et al. [41] proponen un modelo de

optimizacion multicriterio, el cual es nucleo de un DSS (Decision Support Sys-

tem) en desarrollo enfocado en las organizaciones a cargo de la distribucion de

ayuda humanitaria. Posteriormente, Rodrıguez et al. [30] describen y aplican

una metodologıa para la construccion de un SEDD (Expert System for Disaster

Diagnosis), el cual es un prototipo de un DSS de dos niveles basado en datos,

que proporciona una evaluacion de danos para multiples escenarios de desas-

tres con objeto de apoyar a ONGs humanitarias involucradas en la respuesta a

desastres naturales. Fikar et al. [11] presentan un DSS que permite planear los

envıos de la ultima milla de artıculos de socorro, ademas de probar los diversos

impactos del uso de diferentes configuraciones de flota para la coordinacion de

los envıos. Similar a lo propuesto en nuestra investigacion, Gatica et al. [12]

presentan una aplicacion web que asigna superdepositos, y establece el ruteo

de vehıculos para cubrir los centros de distribucion, considerando diversas pro-

babilidades de poblaciones que han de ser cubiertas. En este trabajo integran

herramientas de optimizacion con datos proporcionados en la plataforma Goo-

gle Maps. Finalmente, Zhao et al. [47], presentan un problema de optimizacion

multiobjetivo geoespacial, para lo cual disenan y desarrollan una herramienta

llamada UERFLsOptimizer, la cual permite optimizar la localizacion de insta-

laciones de rescate mediante un algoritmo evolutivo multiobjetivo y facilita el

procesamiento de datos geograficos junto al modelo de optimizacion.

La principal contribucion de este estudio es el diseno y construccion de una

herramienta web utilizando el sistema de informacion geografico Google Maps y

el lenguaje de programacion Go, que facilita la recoleccion de la informacion de

los posibles albergues (nodos clientes) que se abrirıan ante un desastre sısmico,

la generacion de la matriz origen destino, la implementacion del algoritmo ABC

y la presentacion de los resultados del problema de distribucion de recursos en

el mapa de la ciudad en estudio. Ademas, en este trabajo se presenta un modelo

para la programacion de la distribucion de recursos humanitarios desde un centro

de distribucion hacia albergues temporales, con el proposito de minimizar el

tiempo de respuesta. Se propone un enfoque de solucion heurıstico basado en un

algoritmo de Colonia de Abejas Artificiales (ABC, por sus siglas en ingles) con

un componente evolutivo. Para la validacion del algoritmo propuesto se usaron


ejemplos de la literatura. El resto del trabajo esta organizado ası: en la seccion 2

se presenta la descripcion del problema a solucionar. En la seccion 3 se plantea la

arquitectura de la herramienta. En la seccion 4 se describe la red de transporte

considerada para el caso de estudio y su respectiva geo codificacion. En la seccion

5 se presentan los resultados de la herramienta para el caso de estudio; finalmente

en la seccion 6 se presentan las conclusiones y recomendaciones para trabajos

futuros.

2. Descripcion del problema

La distribucion de recursos ante un desastre consiste en garantizar el flujo

optimo de bienes y servicios con el fin de reducir la vulnerabilidad frente al clima,

la inseguridad, lesiones fısicas y enfermedades a las cuales estan expuestos los

damnificados [14]. La distribucion de recursos es una operacion compleja por

multiples razones, la principal es que no es posible conocer con certeza y en

tiempo real las necesidades de los damnificados y los recursos vitales dentro de

la emergencia.

En las situaciones de desastre sısmico el tipo de recursos a distribuir es

independiente de la demanda [25], es decir, algunos de los recursos demandados

probablemente no se suplan debido a que la oferta de suministros de ayuda

humanitaria no se define por la demanda de estos, sino por las donaciones hechas

por entes nacionales y paıses externos. Algunos de los recursos demandados se

suministran una unica vez a la poblacion afectada. Sin embargo existen otros

que se suministran con determinada frecuencia como son los medicamentos,

alimentos, agua potable, elementos para la reconstruccion de la zona y elementos

de proteccion personal, entre otros [39]. La complejidad del modelo de asignacion

crece a medida que aumentan los puntos de demanda y los tipos de suministros,

por lo cual, algunos autores para simplificar el modelo reducen la variedad de

tipos de suministros a kits por familia. Dicha simplicidad es usada por la Cruz

Roja Internacional y otras entidades de apoyo.

Las rutas y las caracterısticas de los vehıculos se determinan de acuerdo al

tipo de bien y servicio que se desea transportar. Analizar estas dos variables

de manera conjunta resulta ser una opcion para la solucion del problema de

distribucion de recursos [13]. El desempeno y eficiencia con la que son atendidos

los damnificados no solo depende de la capacidad de los vehıculos y la longitud

de las rutas, sino tambien de otros factores como la incertidumbre de las vıas

[5]. El problema de rutas de vehıculos es el mas utilizado para representar las

condiciones de la distribucion de recursos humanitarios, y se refiere a la determi-

nacion de las rutas optimas utilizadas por una flota de vehıculos, considerando

uno o mas depositos, para servir a un conjunto de clientes. Este problema juega

un papel fundamental en la gestion logıstica [27]. En este estudio se modela el

problema de distribucion de recursos mediante una variante del VRP, conocida


como CVRP (Capacitated Vehicle Routing Problem). En un CVRP, ademas de

las condiciones estandar del VRP, se restringe la capacidad del vehıculo y la

demanda total asignada a cada ruta. Entre las caracterısticas basicas de este

problema se encuentran: las demandas de los clientes son determinısticas y se

conocen con anticipacion; una orden no puede ser servida por mas de un vehıcu-

lo; la flota de vehıculos es homogenea y solo hay un deposito. El problema de

distribucion de recursos humanitarios para este trabajo es formulado como el

CVRP presentado por [38] cambiando la funcion objetivo por el tiempo total de

distribucion.

Figura 1: Arquitectura de BEE-UIS

3. Arquitectura BEE-UIS

La plataforma BEE-UIS da solucion al problema de distribucion de ayudas

humanitarias, y para esto se encuentra dividida en tres modulos: decodificador,

requerimientos y ABC. La arquitectura de la plataforma se muestra en la Fi-

gura 1. Esta requiere la geolocalizacion de los diferentes albergues y del centro

de distribucion a considerar en el problema a traves de un archivo KML (Key-

hole Markup Language). Despues de recibir las geolocalizaciones, la plataforma

realiza una decodificacion de estas a traves del modulo decodificador para que

el usuario pueda realizar el ingreso detallado de los requerimientos y matriz de

origen destino dada por el API de google maps y ası presentar la solucion del

problema al usuario a traves de la interfaz grafica.

3.1. Algoritmo de colonia de abejas artificiales ABC

El metodo de solucion desarrollado se fundamenta en el algoritmo Artificial

Bee Colony (ABC). El algoritmo ABC fue Karaboga et al. [19], definiendo una

colmena artificial formada por una zona de comunicacion, o zona de baile, y

tres tipos de abejas (obreras, observadoras y explotadoras). El algoritmo ABC

es un metodo iterativo y comienza generando fuentes de alimento que represen-

tan soluciones al problema. Las abejas obreras son asignadas a estas fuentes y

se encargan de evaluar la calidad del nectar (funcion de salud). Durante cada


iteracion las abejas obreras buscan fuentes de alimento (soluciones) cercanas a

la asignada utilizando operadores de vecindad y evaluan la calidad del nectar

en la nueva fuente de alimento. Si la calidad del nectar es mejor en la nueva

fuente de alimento, es decir, la funcion de salud ha sido mejorada, entonces la

abeja migrara a la nueva fuente. Una vez las abejas obreras terminan el pro-

ceso de explotacion de la fuente de alimento estas comparten la informacion

con las abejas observadoras. Teniendo en cuenta la informacion encontrada por

las abejas obreras, las abejas observadoras proceden a explorar fuentes de ali-

mento cercanas utilizando operadores de busqueda local y evaluando la calidad

del nectar en estas fuentes, para finalmente proceder a seleccionar las fuentes

de alimento con la mejor calidad; este procedimiento es repetido hasta que se

cumpla el criterio de parada (kmax).

Figura 2: Pseudocodigo del algoritmo ABC, adaptado de [37]

En esta investigacion, se propone una modificacion del algoritmo ABC origi-

nal mediante la implementacion de operadores propios de los algoritmos evolu-

tivos. Se utiliza una estrategia de movimientos basicos de intercambio entre los

arcos para explotar las soluciones generadas y ampliar el espacio de busqueda de

los agentes (abejas) hacia nuevas zonas de solucion. En la Figura 2 se muestra


el pseudocodigo del algoritmo propuesto.

En el algoritmo ABC las soluciones al problema son presentadas mediante

fuentes de alimento que son exploradas y explotadas por las abejas. Para el

CVRP estudiado, la representacion de la solucion corresponde a un arreglo de

n + m + 1 posiciones, siendo n el numero de clientes y m el numero de rutas.

Cada posicion 0 representa el inicio y fin de una ruta, es decir, la salida y lle-

gada al deposito, y un numero entre 1 y n corresponde a la visita de un cliente

en la ruta correspondiente. Un ejemplo de esta representacion de la solucion es

mostrado en la Figura 3. Para generar la solucion inicial, se uso la heurıstica del

vecino mas cercano presentada por [31]. Esta heurıstica determina una solucion

basada en la cercanıa de ubicacion para unir un conjunto de clientes distribuidos

en el espacio.

Figura 3: Ejemplo representacion de la solucion

La heurıstica construye las rutas de forma secuencial, eligiendo el nodo mas

cercano al nodo actual como el siguiente nodo a incluir en la ruta, iniciando

desde el deposito. La inspeccion de la cercanıa de los nodos, se hace de manera

iterativa, y en cada paso, se examina la vecindad del nodo actual para la eleccion

del nodo a insertar en la ruta verificando simultaneamente el cumplimiento de

la restriccion de capacidad. El proceso termina cuando todos los nodos han sido

asignados a una ruta.

Para la exploracion de nuevas fuentes de alimento (soluciones) se utiliza un

operador de vecindad basado en el enfoque del operador evolutivo de cruce. En

este operador la mejor sub-ruta de la solucion actual de acuerdo a un criterio de

tiempo de recorrido, es seleccionada y heredada en la nueva solucion. Las demas

sub-rutas de la solucion actual son modificadas mediante el intercambio aleatorio

de clientes (operador swap) entre sub-rutas y heredadas a la nueva solucion

generada. Un ejemplo del procedimiento anterior se muestra en la Figura 4.

Para la explotacion de las soluciones actuales, los operadores de busqueda

local se realizan bajo enfoques 2-opt y 3-opt y la eleccion del enfoque esta sujeta

a la longitud de la sub-ruta generada. Ya que no todas las posibilidades de nuevas

rutas se evaluan, debido a que esta heurıstica debe satisfacer las restricciones y

condiciones del problema, mediante la seleccion por ruleta se eligen las soluciones

a explotar por las abejas observadoras.

En cada paso, las fuentes de alimento son evaluadas mediante la funcion de

salud definida en (3.1). Esta funcion evalua el tiempo acumulado de distribucion


Figura 4: Ejemplo operadores de exploracion

y a su vez penaliza el no cumplimiento de las restricciones de capacidad de los

vehıculos, siendo Xijk una variable binaria que indica la asignacion del k-esimo

vehıculo a el arco (i, j), Tijk corresponde al tiempo de viaje del k-esimo vehıculo

del arco (i, j), λ el costo de no cumplimiento de la restriccion de capacidad, di es

la demanda del cliente i, Yik es una variable binaria la cual indica la asignacion

del cliente i al vehıculo k, C corresponde a la capacidad de cada vehıculo k. Un

numero lımite de iteraciones dado como parametro de entrada al algoritmo es

utilizado como criterio de parada.

n∑i=0

n∑j=0

K∑k=1

Tijk ·Xijk + λ ·n∑i=0

K∑k=1

max {di · Yik − C, 0} (3.1)

Para evaluar el desempeno del algoritmo ABC y validar la coherencia de

los resultados, se utilizaron los problemas: A-n37-K6, A-n32-K5, B-n31-K5, B-

n43-K6, P-n16-K8, P-n22-K2, E-n22-K4, E-n23-K3 y E-n33-K4 consultados en

http://neo.lcc.uma.es/vrp/vrp-instances/capacitated-vrp-instances/ Los proble-

mas seleccionados son simetricos y con distancias euclideas. El algoritmo ABC

fue implementado en el lenguaje de programacion de codigo abierto Go version

1.5 desarrollado por Google en el ano 2012. Las pruebas fueron ejecutadas en

un ordenador con procesador CORE i7 de 64 bits, 3.1 GHz y 8 GB de memoria

RAM. Cada problema fue resuelto usando 200, 500, 1000 y 2000 iteraciones. En

cada caso se elige la solucion con menor gap respecto a la mejor solucion encon-

trada en la literatura en cada problema. Los gaps porcentuales encontrados en

cada caso se presentan en la Figura 5. Se encuentran valores inferiores al 20 %

en los casos evaluados.

4. Caso de estudio

Seguidamente se estudia la distribucion de recursos postdesastres sısmicos

en la ciudad de Bucaramanga usando la plataforma web creada por los auto-


Figura 5: Diferencia procentual respecto a la solucion optima

res. Para el desarrollo del caso de estudio fue necesario no solo determinar los

parametros del modelo, sino ademas recoger informacion pertinente para la pre-

sentacion de los resultados del algoritmo. Se propone un escenario en que se

usa toda la capacidad de los albergues temporales en un periodo de distribucion

semanal.

4.1. Descripcion del caso

El presente caso de estudio se centra en la ciudad de Bucaramanga (Co-

lombia) ubicada al noreste del paıs. Bucaramanga esta localizada a 50 km de

una de las zonas de mayor actividad sısmica del mundo, los focos sısmicos de

profundidad intermedia denominados “nido sısmico de Bucaramanga” [7].

La red de distribucion consta de 72 nodos cliente, que corresponden a los

albergues candidatos. Los 72 albergues estan constituidos por 47 instituciones

educativas de caracter publico, 17 zonas verdes (parques y zonas forestales) y

8 zonas deportivas (coliseos, estadios y canchas). Estos 72 nodos fueron geo-

localizados sobre el mapa de la ciudad de Bucaramanga mediante Google Maps,

ver Figura 6. Para cada uno de estos nodos se elaboro un perfil, el cual contiene

informacion sobre la comuna y el barrio en el que esta localizado, el tipo de

albergue (Educativo, Zona verde o Deportivo), su area total en metros cuadra-

dos, el porcentaje de zonas comunes o arborizacion (dependiendo del tipo de

albergue este porcentaje se extrae del area total ya que se considera destinado

para los servicios basicos necesarios para el adecuado funcionamiento del alber-

gue), y por ultimo su capacidad en numero de personas que puede albergar,


ver Figura 7. Para estimar la demanda de suministros di en cada albergue de

la red, se realiza una estimacion de la capacidad disenada, es decir, el maximo

numero de personas que podrıa albergar (δ) considerando el tipo de instalacion.

En la Tabla 1 se presentan las formulas usadas para cuantificar la demanda de

los albergues, donde A es el area total del albergue y ε es un parametro que

indica el porcentaje del area que se destina a zonas comunes (cocina, banos,

comedores, pasillos) necesarias para el funcionamiento del albergue. En el caso

de los albergues denominados zonas verdes, este parametro incluye la superficie

arbolada. Finalmente, β es el area mınima en un albergue destinada para una

persona (3.5 m2) de acuerdo a la Cruz Roja Internacional. Para las instituciones

educativas, la demanda se calcula teniendo en cuenta dos factores, el numero de

estudiantes matriculadas en la institucion (µ), informacion proporcionada por la

secretarıa de educacion de la ciudad, y la superficie destinada para cada alumno

en un aula de clase (α), aproximadamente 1.65 m2 de acuerdo al Ministerio de

Educacion Nacional.

Figura 6: Mapa de la red de distribucion de ayudas


Figura 7: Perfil del albergue

Instalaciones deportivas Zonas verdes Instalaciones educativas

δ = A−(A∗ε)β δ = A−(A∗ε)

β δ = µ∗αβ

Tabla 1: Formulas para el calculo de la capacidad de los albergues

Tipo de instalacion Cantidad Total albergadosEducativa 47 29.086

Zonas Verdes 17 9.718Zonas Deportivas 8 4.785

Total 72 43.589

Tabla 2: Numero total de personas albergadas por tipo de albergue.


La demanda estimada total por tipo de instalacion se presenta en la Tabla

2. La distribucion de los albergues en cada una de las 17 comunas de la ciudad

se presenta en la Tabla 3.

Comuna Nombre Inst. educativas Parques Polideportivos

17 Mutis 3 0 1

16 Lagos de Cacique 2 0 1

15 Centro 2 5 0

14 Morrorico 1 0 0

13 Oriental 5 4 2

12 Cabecera del Llano 1 2 0

11 Sur Giron 1 0 0

10 Provenza 3 1 0

9 Pedregosa 1 0 0

8 Suroccidente 0 0 0

7 La Ciudadela 4 1 1

6 La Concordia 7 0 0

5 Garcia Rovira 3 2 1

4 Occidental 4 0 0

3 San Francisco 6 2 0

2 Nororiental 1 0 0

1 Norte 3 0 2

Total 47 17 8

Tabla 3: Distribucion de los tipos de albergues por comuna

Actualmente la ciudad cuenta con un unico centro de distribucion (CEDI).

Los recursos a distribuir son kits de ayuda, los cuales son empaquetados en cajas

de carton de 34.2 cm x 28.4 cm x 24 cm. El kit suministra bienes basicos para

alimentar a una familia de 5 personas durante una semana, que es el horizonte de

planificacion del problema. El CEDI puede almacenar aproximadamente 9.300

kits. La flota de vehıculos destinada para la distribucion de ayudas consta de

camiones con capacidad para 608 kits; teniendo en cuenta esta capacidad se

estima que cada vehıculo puede atender la demanda de aproximadamente 3.040

personas. Se asume que la demanda siempre sera servida en su totalidad, por

lo cual se calcula el numero de vehıculos necesarios para distribuir 8.718 kits

que satisfacen la demanda total (43.589 personas en albergues temporales). De

acuerdo al numero de kits que son servidas por cada vehıculo, se requieren

aproximadamente 15 vehıculos para la distribucion de recursos. En el presente

trabajo se asume que las vıas son aptas, esto es, transitables despues de ocurrido

el desastre.


5. Resultados computacionales y analisis

Se diseno una herramienta web para presentar la solucion del problema ba-

sada en el sistema de informacion geografico Google Maps y denominada ”BEE-

UIS”. La herramienta cuenta con cuatro iconos: icono de localizacion, que per-

mite la geolocalizacion de los albergues y el centro de distribucion, icono de

demanda, que facilita el ingreso de la demanda de cada uno de los albergues de

la red, un icono que permite introducir las caracterısticas propias de la flota de

transporte (cantidad y capacidad), ası como un icono que ejecuta al algoritmo

ABC para generar la solucion al problema de distribucion, la cual se representa

sobre el mapa de la ciudad. Una lista permite seleccionar la ruta que se quiere

observar en pantalla para facilitar la lectura de la solucion. En la Figura 8 se

muestra la interfaz de la herramienta, la cual ejecuta el algoritmo ABC y proyec-

ta sobre el mapa de la ciudad las rutas de solucion al problema de distribucion

de recursos postdesastres sısmicos para Bucaramanga.

Figura 8: Interfaz de la herramienta BEE-UIS

La Tabla 4 presenta la informacion de la mejor solucion encontrada para el

caso de estudio. Se generan 15 rutas; el tour de la ruta 1 visita 9 albergues y

es la ruta que mas rapido puede servir la demanda asignada. En la Figura 9

se muestra el recorrido que debe seguir el camion en la ruta 1, mientras que

la secuencia de los albergues visitados junto a sus respectivas demandas se in-

dica en la Tabla 5. En general las rutas presentan secuencias de visita donde

los albergues estan relativamente cerca uno de otro, como resultado del enfoque

usado en la heurıstica del vecino mas cercano.


Ruta Albergues asignados Tiempo [minutos] Demanda [personas]

1 9 23.20 2.993

2 7 45.36 2.907

3 7 37.31 2.769

4 7 49.18 2.841

5 4 28.56 2.919

6 5 40.50 2.948

7 4 61.16 2.921

8 3 24.63 2.635

9 4 33.40 2.951

10 5 36.58 3.011

11 6 59.68 2.973

12 3 44.78 2.906

13 4 51.95 3.005

14 2 45.25 2.986

15 2 51.95 2.824

Total 72 633.49 43.589

Media 4.8 42.23 2.905, 93

Desviacion 2.04 11.72 102, 34

CV 42.6 % 27.7 % 3, 5 %

Tabla 4: Resumen de los resultados para el escenario de Bucaramanga

Figura 9: Ruta 1 en la solucion generada por BEE-UIS


N Albergues visitados Demanda0 Centro de distribucion 01 Colegio Sagrado Corazon 1202 Parque Cristo Rey 8763 Instituto Comuneros 3364 Colegio Americano 945 Parque San Francisco 7116 Colegio San Francisco 1017 Parque Antonia Santos 6178 Colegio Psicopedagogico 609 Colegio Francisco Virrey 780 Centro de distribucion 0

Tabla 5: Secuencia de visitas en la ruta 1

Los albergues pueden ser servidos dentro de las 72 horas consideradas como

el tiempo de respuesta estandar en la atencion de desastres, ya que el recorrido

mas largo para la distribucion de las ayudas es de 61.16 minutos aproxima-

damente. Adicionalmente se encuentra un bajo coeficiente de variacion en la

demanda servida por ruta indicando que la solucion busca el aprovechamiento

de la capacidad utilizada por vehıculo, aun cuando las variaciones en la cantidad

de albergues visitados y tiempos de distribucion son superiores al 20 %.

6. Conclusiones y trabajos futuros

La contribucion mas importante de esta investigacion es el desarrollo de

una herramienta web basada en el sistema de informacion geografico de Google

Maps y codificada en el lenguaje de programacion Go, que permite implemen-

tar el algoritmo ABC para resolver de un problema de distribucion de recursos

modelado mediante un CVRP. Aunque el algoritmo ABC muestra resultados

aceptables en la validacion con ejemplos de la literatura, se evidencia que las es-

trategias implementadas para la mejora de las soluciones no son suficientemente

eficientes; sin embargo, el algoritmo encuentra soluciones factibles y de buena

calidad de acuerdo a la metrica establecida para el caso de estudio propuesto. Se

presentan tiempos de distribucion dentro del tiempo estandar para la atencion

de desastres.

En trabajos futuros se propone considerar la incertidumbre en los datos del

problema, ya que la informacion en un desastre es imprecisa, por ejemplo, la

disponibilidad de los arcos. Por lo tanto, la inclusion de parametros de vulnera-

bilidad en la red vial puede proporcionar una solucion mas real en modelos de

distribucion. En relacion a la incertidumbre, futuros problemas pueden ser tra-

tados mediante un SVRP (Stochastic Vehicle Routing Problem) para abordar


la demanda incierta, entre otros parametros que generalmente se asumen cono-

cidos durante un desastre. Ası mismo, se puede considerar otras variantes del

VRP, por ejemplo, el CVRPTW (Capacitated Vehicle Routing Problem with

Time Windows), util en la evacuacion de civiles mediante medios de transporte

en la fase de respuesta a desastres, o el PVRP (Periodic Vehicle Routing Pro-

blem), que permitirıan abarcar diferentes escenarios presentes en la cadena de

suministros en situaciones de desastres naturales y acercarse a condiciones reales

del problema (diferentes niveles de prioridad, disponibilidad de rutas). Ademas,

nuevos modelos pueden tener en cuenta diferentes tipos de vehıculos terrestres,

aereos y marıtimos, frecuentemente usados en la atencion a desastres. Por otro

lado, durante los desastres algunas ayudas suministradas son productos perece-

deros, los cuales exigen consideraciones especiales en un modelo de distribucion,

como la fecha de caducidad o transporte especial. Finalmente se recomienda

la utilizacion de SIG (Sistema de Informacion Geografica) junto con problemas

de optimizacion combinatoria que permitan generar herramientas de soporte a

toma de decision en un contexto de gestion de desastres tales como los Sistemas

de Apoyo a Toma de Decisiones (DSS).

Agradecimientos

Esta investigacion se realizo en el marco del proyecto N◦ 1806“Un framework

para la gestion de sistemas de atencion de emergencias de catastrofes naturales

mediante el problema de localizacion-ruteo con ventanas de tiempo blandas

LPRT”, el cual fue financiado por la Universidad Industrial de Santander, a

quien se agradece su colaboracion.

Referencias

[1] Adivar B. and Mert A. (2010). International disaster relief planning with

fuzzy credibility. Fuzzy Optimization and Decision Making,9, 413-433.

[2] Afsar H., Prins C. and Santos A. (2014). Exact and heuristic algorithms

for solving the generalized vehicle routing problem with flexible fleet size.

International Transactions in Operational Research,21, 153-175.

[3] Anaya-Arenas A., Renaud J. and Ruiz A. (2014). Relief distribution net-

works: a systematic review. Annals of Operations Research,223, 53-79.

[4] Balcik B., Beamon B. and Smilowitz K. (2008). Last mile distribution in

humanitarian relief. Journal of Intelligent Transportation Systems,12, 51-

63.

[5] Barbarosoglu G. and Arda Y. (2004). A two-stage stochastic programming

framework for transportation planning in disaster response. Journal of the

operational research society,55, 43-53.


[6] Chang F., Wu J., Lee C. and Shen H. (2014). Greedy-search-based multi-

objective genetic algorithm for emergency logistics scheduling. Expert Sys-

tems with Applications,41, 2947-2956.

[7] Coral-Gomez C. (1990). La Convergencia de Placas en el Noroccidente Sur-

americano y el Origen del Nido de Bucaramanga. Rev. Acad. Colombiana de

Ciencias Exactas, Flsicas y Naturales,17, 521-529.

[8] Dantzig G. and Ramser J. (1959). The truck dispatching problem. Manage-

ment science,6, 80-91.

[9] De Aragao M. and Uchoa E. (2003). Integer program reformulation for ro-

bust branch-and-cut-and-price algorithms. Mathematical Program in Rio: a

Conference in Honour of Nelson Maculan, 56-61.

[10] Dell’Amico M., Righini G. and Salani M. (2006). A branch-and-price ap-

proach to the vehicle routing problem with simultaneous distribution and

collection. Transportation Science,40, 235-247.

[11] Fikar, C., Gronalt, M., and Hirsch, P. (2016). A decision support system for

coordinated disaster relief distribution. Expert Systems with Applications,57,

104-116.

[12] Gatica, G., Contreras-Bolton, C., Venegas, N., Opazo, O., et al (2017). Una

aplicacion web, para asignacion y ruteo de vehıculos en caso de desastres.

Iteckne,14(1), 62-69.

[13] Hamedi M., Haghani A. and Yang S. (2012). Reliable transportation of

humanitarian supplies in disaster response: model and heuristic. Procedia-

Social and Behavioral Sciences,54, 1205-1219.

[14] Holguın-Veras J., Taniguchi E., Ferreira F., et al. (2012). The Tohoku

Disasters: Preliminary Findings Concerning The Post Disaster Humanita-

rian Logistics Response. Annual meeting of Transportation Research Board,

Transportation Research Board, Washington, DC (USA).

[15] Hu Z. (2011). A container multimodal transportation scheduling approach

based on immune affinity model for emergency relief. Expert Systems with

Applications,38, 2632-2639.

[16] Huang A., Ma A., Schmidt S., Xu N., Zhang B., et al. (2013). Integration

of Real Time Data in Urban Search and Rescue. Center For The Commer-

cialization Of Innovative Transportation Technology, Transportation Center,

Northwestern University.


[17] Huang M., Smilowitz K., and Balcik B. (2013). A continuous approximation

approach for assessment routing in disaster relief. Transportation Research

Part B: Methodological,50, 20-41.

[18] Jotshi A., Gong Q., and Batta R. (2009). Dispatching and routing of emer-

gency vehicles in disaster mitigation using data fusion. Socio-Economic Plan-

ning Sciences,43, 1-24.

[19] Karaboga D., Akay B., and Ozturk C. (2007). Artificial bee colony (ABC)

optimization algorithm for training feed-forward neural networks. Interna-

tional Conference on Modeling Decisions for Artificial Intelligence, 318-329.

[20] Lin Y.-H., Batta R., Rogerson P. A., et al. (2012). Location of temporary

depots to facilitate relief operations after an earthquake. Socio-Economic

Planning Sciences,46, 112-123.

[21] Liu M. and Zhao L. (2012). An integrated and dynamic optimisation model

for the multi-level emergency logistics network in anti-bioterrorism system.

International Journal of Systems Science,43, 1464-1478.

[22] Luis E., Dolinskaya I. and Smilowitz K. (2012). Disaster relief routing:

Integrating research and practice. Socio-economic planning sciences,46, 88-

97.

[23] Nagy G. and Salhi S. (2005). Heuristic algorithms for single and multiple

depot vehicle routing problems with pickups and deliveries. European journal

of operational research,162, 126-141.

[24] Najafi M., Eshghi K. and De Leeuw S. (2014). A dynamic dispatching and

routing model to plan/re-plan logistics activities in response to an earthqua-

ke. OR spectrum,36, 323-356.

[25] Najafi M., Eshghi K. and Dullaert W. (2013). A multi-objective robust

optimization model for logistics planning in the earthquake response phase.

Transportation Research Part E: Logistics and Transportation Review,49,

217-249.

[26] Ozdamar L. and Ertem M. (2015). Models, solutions and enabling tech-

nologies in humanitarian logistics. European Journal of Operational Re-

search,244, 55-65.

[27] Panapinun K. and Charnsethikul P. (2005). Vehicle and scheduling pro-

blems: A case study of food distribution in greater Bangkok. Bangkok: Ka-

setsart University, Bangkok (Thailand).

[28] Pisinger D. and Ropke S. (2010). Large neighborhood search. Handbook of

metaheuristics, Springer, Boston.


[29] Rivera J., Afsar H. and Prins C. (2015). A multistart iterated local search

for the multitrip cumulative capacitated vehicle routing problem. Compu-

tational Optimization and Applications,61, 159-187.

[30] Rodriguez, J., Vitoriano, B., and Montero, J. (2012). A general metho-

dology for data based rule building and its application to natural disaster

management. Computers & Operations Research,39(4), 863-873.

[31] Rosenkrantz D., Stearns R., Lewis I., et al. (1977). An analysis of several

heuristics for the traveling salesman problem. SIAM journal on computing,6,

563-581.

[32] Saadatseresht M., Mansourian A. and Taleai M. (2009). Evacuation plan-

ning using multiobjective evolutionary optimization approach. European

Journal of Operational Research,198, 305-314.

[33] Safeer M., Anbuudayasankar, S., Balkumar K., et al. (2014). Analyzing

Transportation and Distribution in Emergency Humanitarian Logistics. Pro-

cedia Engineering,97, 2248-2258.

[34] Shen Z., Dessouky M. and Ordonez F. (2009). A two–stage vehicle routing

model for large–scale bioterrorism emergencies. Networks,54, 255-269.

[35] Sheu J. (2007). An emergency logistics distribution approach for quick res-

ponse to urgent relief demand in disasters. Transportation Research Part E:

Logistics and Transportation Review,43, 687-709.

[36] Sheu J. (2010). Dynamic relief-demand management for emergency logis-

tics operations under large-scale disasters. Transportation Research Part E:


[37] Szeto, W., Wu, Y., and Ho S. (2011). An artificial bee colony algorithm for

the capacitated vehicle routing problem. European Journal of Operational

Research,215(1), 126-135.

[38] Toth, P., and Vigo, D. (2002). The vehicle routing problem. Society for

Industrial and Applied Mathematics.

[39] Tzeng G., Cheng H. and Huang T. (2007). Multi-objective optimal plan-

ning for designing relief delivery systems. Transportation Research Part E:


[40] Vitoriano B., Ortuno T. and Tirado G. (2009). HADS, a goal program-

ming–based humanitarian aid distribution system. Journal of Multi–Criteria

Decision Analysis,16, 55-64.


[41] Vitoriano B., Ortuno M., Tirado G. et al. (2011). A multi-criteria optimi-

zation model for humanitarian aid distribution. Journal of Global Optimiza-

tion,51, 189-208.

[42] Wallace W. A. and De Balogh F. (1985). Decision support systems for

disaster management. Public Administration Review, 134-146.

[43] Widener M. and Horner M. (2011). A hierarchical approach to modeling hu-

rricane disaster relief goods distribution. Journal of Transport Geography,19,

821-828.

[44] Wohlgemuth, S., Oloruntoba R. and Clausen U. (2012). Dynamic vehicle

routing with anticipation in disaster relief. Socio-Economic Planning Scien-

ces,46, 261-271.

[45] Yuan Y. and Wang D. (2009). Path selection model and algorithm for

emergency logistics management. Computers & Industrial Engineering,56,

1081-1094.

[46] Zhang X., Zhang Z., Zhang Y.,et al. (2013). Route selection for emergency

logistics management: A bio-inspired algorithm. Safety science,54, 87-91.

[47] Zhao, M., and Liu, X. (2018). Development of decision support tool for op-

timizing urban emergency rescue facility locations to improve humanitarian

logistics management. Safety science,102, 110-117.

Acerca de los autores

Henry Lamos se graduo en PhD. en Matematica Fısica en 1997 por la Univer-

sidad Estatal de Moscu (LOMONOSOV), Rusia; MSc en Informatica en 1990

por la Universidad Industrial de Santander, Colombia; MSc. en Matematicas en

1982 por la Universidad de la Amistad, Moscu, Rusia; y Matematico en 1981

por la Universidad de la Amistad, Moscu, Rusia. Es docente investigador del

Grupo de Investigacion en Optimizacion y Organizacion de Sistemas Produc-

tivos y Logısticos-OPALO. UIS-UNAB pertenecen a la Asociacion Colombiana

de Investigacion de Operaciones ASOCIO, es profesor titular de la Universi-

dad Industrial de Santander, adscrito a la Escuela de Estudios Industriales y

Empresariales de la Universidad Industrial de Santander, Colombia.

Karin Aguilar estudio Ingenierıa Industrial en 2013 en la Universidad In-

dustrial de Santander, Colombia. Posteriormente, en 2017 curso en la misma

universidad un MSc. en Ingenierıa Industrial. Trabajo como profesional de in-

vestigacion para el Grupo de Investigacion en Optimizacion y Organizacion de

Sistemas Productivos y Logısticos OPALO durante 2014 y 2015. Actualmente


es docente, afiliada al Grupo OPALO, adscrito a la Escuela de Estudios Indus-

triales y Empresariales de la Universidad Industrial de Santander, Colombia.

Daniel Martınez estudio Ingenierıa Industrial en 2014 en la Universidad In-

dustrial de Santander, Colombia. Posteriormente, en 2017 curso en la misma

universidad un MSc. en Ingenierıa Industrial. Actualmente es docente, afiliado

al Grupo de Investigacion en Optimizacion y Organizacion de Sistemas Pro-

ductivos y Logısticos OPALO, adscrito a la Escuela de Estudios Industriales y

Empresariales de la Universidad Industrial de Santander, Colombia.

Andres Barrera estudio Ingenierıa Industrial en 2016 en la Universidad In-

dustrial de Santander, Colombia. Actualmente es especialista en Gerencia de

Mercados en la Universidad del Rosario, Colombia.

Angie Hernandez estudio Ingenierıa Industrial en 2016 en la Universidad

Industrial de Santander, Colombia. Actualmente trabaja como profesional de

apoyo al modelo de control interno en la alcaldıa municipal de San Jose de

Miranda (Santander, Colombia).


Estadıstica Oficial

Quality implications of the use of big data in tourism

statistics: three exploratory examples

Fernando Cortina Garcıa, Marıa Izquierdo Valverde, Jesus Prado

Mascunano and Marıa Velasco Gimeno

National Statistical Institute (INE)


B [email protected], B [email protected]

Abstract

Tourism statistics is one of the subject areas which are being considered

at present in the ESS as a potential field for the development of big data

use in order to improve the relevance, opportunity and punctuality of

the products offered under the quality standards of official statistics. In

Spain, data from traffic loops and traffic control cameras are already being

used in the estimation of inbound tourists. The paper presents three pilot

studies about the use of big data and the integration of multiple sources:

credit cards, mobile phones and web scraping (to collect prices of package

tours and of its components).

Keywords: big data, traffic loops, mobile phones, credit cards, package

tours, net valuation, web scraping.

1. Introduction

Tourism is one of the most dynamic industries in many economies. According

World Tourism Organization (WTO)1, it represents 10% of the world GDP, one

in ten jobs, have increased 7% to 1.3 billion in 2017. Estimation of tourism

flows come mainly from border crossing and accommodation statistics, and from

household surveys to resident population. The interregional component of this

phenomenon makes comparability an essential feature of the reliability of data,

which has been developed within the frame of WTO and Eurostat manuals,

guides and recommendations. Thus, these sources of primary information, which

have been providing data for many years, have a strong methodological base and

fulfil high quality standards.

1Source: International Tourism Arrivals infographics, WTO. http://media.unwto.org/

content/infographics (extracted 18th July 2018).

c© 2018 SEIO

http://media.unwto.org/content/infographics

http://media.unwto.org/content/infographics

Quality implications of the use of big data in tourism statistics 139

However, in a world where mobility has increased to its highest levels in few

years and border controls have disappeared in neighbouring areas, such as the

Schengen Space in Europe, border crossing surveys are becoming more costly

and difficult to conduct, and many countries are looking for alternatives and

complementary information.

In this context, data generated not from purely statistical sources but from

events intimately linked to the tourism phenomenon appear as a source of in-

formation that can improve the relevance, opportunity and punctuality of the

products offered under the quality standards of official statistics. Examples of

these new data sources are registers from traffic loops and traffic control cameras

capturing flows of vehicles, records of mobile phones travelling from one place

to another, activity of credit cards during a trip, among others.

Eurostat identified the potential of this sources for tourism statistics and

launched in 2012 a project on the use of mobile data. Access to this information

was identified as one important barrier to make its use feasible (Eurostat, 2014).

Since then Task Forces on Big Data have been launched both at European and at

national level to coordinate different projects and initiatives. Spain participates

in an ESSnet pilot project on the use of mobile positioning data for official

statistics whose first objective is obtaining access to the data. In the meanwhile,

some preliminary analysis have been carried out within the Spanish System of

Tourism Statistics, which will be presented in the next sections.

2. Traffic loops and traffic control cameras

The first experience of INE using big data in tourism statistic is related to

the task of building the frame of people crossing the borders by road. Due

to Schengen Treat, there is no control over people that cross the border from

France or Portugal to Spain (and vice versa).

The register of traffic loops provides the total number of vehicles that cross

the border for each crossing-road, in both ways (going in and out of Spain)

by hour and classifying the vehicles according their length (short, medium and

large). This information is completed by the traffic control cameras that are

installed in the border lines (both registers are managed by Traffic General

Direction). It is a complete database of number plates of vehicles that come into

our country. Combining both sets of big data we can estimate the number of

foreign vehicles that enter in Spain monthly broken down by vehicle nationality.

The next step to know the number of persons that come into our country

is transforming the Vehicles Frame in a Travelers Frame. To get this aim,

sample data of vehicles by type of vehicle, nationality (of number plate) and

number of occupants per vehicle are collected. Using the collected information,

an occupancy rate of vehicles, by type of vehicle and nationality is calculated.

Mixing both data the People Frame mentioned before is calculated.

140 F. Cortina, M. Izquierdo, J. Prado, M. Velasco

This is the general schema that is carried out to get this basic information

to estimate the number of foreign visitors (tourist and same-day visitors) that

come to Spain every month by road. In this case we don’t have to face problems

related to different definitions used in the register of traffic loops and traffic

control cameras, vehicles crossing border is the counted unit in both registers.

But the coverage of these sets of data sometimes is not exactly the same, due

to technical problems that are being solved.

Tracking an anonymized number plate through the camera registers database

will allow new studies about same-day visitors.

3. Mobile positioning data

Mobile phones connect to cell towers with a defined geographical coverage.

Mobile phones connected to the network generate events that are recorded in a

database associated to the cell phone ID. Tracking an ID in the events database

gives information about mobility of the cell phone.

These events can be classified in two categories: active and passive:

• Active events: those generated when the subscriber makes or receives a

phone call, sends or receives a text message, or when he switches on or of

the device.

• Passive events: those generated when the telephone is not active. Location

of inactive cell phones is known through ‘location updates’ sent by the

network. Passive events can be generated randomly (when the telephone

changes from one LAC - group of cell towers controlled by the same base

controller- to other) or periodically, every four hours.

The combination of both kind of events is especially relevant in the case of

international tourists, because it allows to analyse a much wider population.

At the moment of this project, the system employed to obtain positioning data

used both active and passive events generated in the layers 2G and 3G.

For the first approximation to the use of mobile phones positioning data,

an ad-hoc extraction from the events database of one of the most important

MNO operators in Spain was defined for analysis. The objective was measuring

the number of tourists both residents and non-residents and their average stay,

broken down by region of destination (NUTS 2) and region/country of origin.

We compared the results obtained with those derived from official statistics. In

the case of residents, data were provided only for august 2014. For non-residents,

data for august 2013 are also available, making possible comparisons over time.

The first step was to identify tourists within the whole events database. To

define them, the international accepted definitions were adapted to the possi-

bilities of the database.


Tourism is defined in the regulation 692/2011 on European tourism statis-

tics as the activity of visitors taking a trip to a main destination outside their

usual environment, for less than a year, for any main purpose, other than to

be employed by a resident entity in the place visited. Usual environment is the

geographical area, not necessarily a contiguous one, within which an individual

conducts his regular life routines.

Tourism includes trips with overnights stays and same-day visits. This study

focuses only in trips with overnight stays. An overnight stay is defined in this

project as a stay of more than 24 hours in a region (NUTS 2) of destination.

In the case of residents, if the region of destination is also the one of origin, the

stay must take place in a municipality (LAU 2) different of the one of residence.

In practice, the standard definition has been adapted to consider as a tourist

every mobile phone staying at least two consecutive days in a region of destina-

tion, the stay comprising an eight-hour period between 22:00 hours of the day

of arrival and 08:00 of the next day. For resident mobile phones, residence is as-

signed empirically, taking into account the different places where the cell phone

has made an overnight stay (between 0:00 and 8:00) in the last six months.

In the case of residents, when we compare mobile positioning data (MPD)

with survey data (FAMILITUR), Table 1 shows a quite similar distribution of

trips among the regions of destination. Main differences are found in Madrid,

where the percentage of tourists identified from MPD is three times bigger than

in the survey. Being Madrid a city with a big metropolitan area, this could

be indicative of the need of a more accurate definition of usual environment.

In fact, taking into account the residence of the trip, Figure 1 below shows

that data from MPD present a higher proportion of intraregional trips than the

survey in all the regions represented, Madrid with the highest difference.

Big differences are found for the variable average stay (Fig. 2). In aggre-

gate terms, mobile data show an average of 13.5 nights for residents’ trips to

a destination in Spain, while survey data estimate is 8.7 nights. Once again,

the definition of tourist and usual environment seem to be the cause of this

discrepancies.

Comparing MPD for non-resident mobiles with survey data (FRONTUR-

EGATUR), the distributions of tourist by country of origin present slight differ-

ences (Table 2). The most important countries are United Kingdom, France and

Germany. In both cases Germany gets the third position, but UK and France

exchange their ranking.

Analysing the average stay the differences are significant (Table 3), always

much higher the estimation of the survey, just the opposite situation that the

resident analysis. For non-residents, usual environment is not expected to be

a general problem, although border areas and residents with would require a

separate analysis. This low averages from MPD could be explained by the fact


Figure 1: Distribution of trips by destination (residents) August 2014.

Figure 2: Share of interregional trips by destination (residents). August 2014.


Figure 3: Average stay by region of destination (residents). August 2014.

Figure 4: Distribution of international tourist by country of origin August 2014.


Figure 5: Average stay by country of origin Augus 2014.

that different legs of the same trip for the survey are considered as different

trips in MPD.

4. Data from credit cards

Another source of information being explored is data recorded by the elec-

tronic payment system of one of the most important banks in Spain. In the

case of residents, we analyse registers of all payments made by the bank clients

in every point of sales terminal (POS) and ATM extractions with an entity

card. Only cash payments and those made with a card of any other entity are

out of scope of the study. For non-residents, available information comes from

payments or extractions in POSs or ATMs in the BBVA network, so we have a

more partial vision of their activity in Spain.

As in the previous case, aggregated results with a high level of detail were

provided following INE’s indications to obtain the information. Direct work

with the database was carried out by the bank.

Residence of the card holder is available based on the information provided

by the client. POS’ are geolocalised. In the study, every payment in a municipal-

ity different from the one of declared residence of the card holder has been con-

sidered as tourism expenditure. Besides, POS are associated with an economic

activity so that expenditures can be broken down in different categories: travel

agencies, food and beverages, accommodation, shopping, recreation, transport,

cash withdrawals and other expenditures. Monthly data are available since

January 2013 to December 2014 both for residents and non-residents.

In the case of residents, when comparing average expenditures per trip, Fig-

ure 3 shows higher values for the credit cards’ data (CCD) series. Coverage

of expenditure is not the same in both sources: official statistics measure ex-

penditures made during the trip and those made for it before it takes place,

while CCD should reflect only expenditures made in destination, thus, during


Figure 6: Average expenditure per trip/card (residents). August 2014.

the trip. Besides, CCD do not include neither cash payments nor those made

with other cards. Consequently, CCD average expenditures were expected to

be lower than survey results. Such a difference must be due to methodological

causes. Further analysis by type of expenditure might provide a clue for this

discrepancies and also an empirical determination of the place of residence of

the card holder should provide better results.

Another aspect we observe in Figure 3 is that seasonality seems to be softer

in the CCD series. One reason underlying this different pattern could be the

fact that in official statistics a trip is assigned to the month of finalization of the

trip, and the expenditures made during or for the trip as well, while credit card

registers may be assigned to the real date in which the payment takes place.

5. Net valuation of package tours: quality limitations of

multisource methods and big data approach

One of the specific aspects of the Tourism Satellite Account is the treatment

of package tours. Unlike the central framework of the National Accounts, the

package tour should be unpacked and should not be treated as a product itself

but as the sum of its components. Each of the components of a package tour,

including the value of the service offered by the tour operator and travel agency,

is considered to be purchased directly by visitors.

The estimation of each of the components of the package tours is complex

for several reasons:

• Information on the costs or prices is generally very sensitive and difficult

to provide by informants.

• In the case of inbound tourism, tour operators are usually non-resident

companies from which is very difficult to obtain information (they are not

required by their national laws to respond to questionnaires from foreign

institutions).

• In general, tour operators negotiate a set price with suppliers for various


products in often very difficult to differentiate the price of each; for exam-

ple hotels they are paid a cost that includes accommodation, breakfast,

sometimes internet, etc.

• The estimation of the percentage of the tourist package that belongs to

the home economy and the percentage to be counted in the destination

economy.

Considering all of the above limitations, the National Statistics Institute

of Spain is developing a pilot exercise through techniques of ”web scrapping”

to get price information for the products offered by tour operators through the

web, whether individually (transport, accommodation, etc.) or together (tourist

packages) in order to obtain a cost structure of the components of the package,

and to improve the quality standards of the estimations. To do so a couple

of tour operators operating on the network have been selected and their travel

offer to one or more destinations (Canary Islands) will be analyzed for a fixed

period (one week) and other similar characteristics. Hopefully the enormous

possibilities offered by the Internet and the use of big data provide information

to the so-called ”unbundling” of tourist packages.

6. Conclusions

The new sources of information are really promising as they can provide

accurate and punctual information about the tourism phenomenon, allowing a

more detailed geographical analysis than those permitted nowadays by official

statistics.

Nevertheless, important and coordinated efforts have to be made by statis-

tical authorities and data providers to obtain results with the quality standards

actually achieved by current official statistics.

The examples presented in this paper try to show that an in - depth conceptu-

alization exercise should be made in first place to identify the phenomenon to be

measured and second to assure comparability over time and between countries.

For testing the more adequate definitions and parameters, assessing impacts of

changes and monitoring the consistency of the decisions finally adopted, official

statisticians need to have big control of the original databases, how are they

processed, every assumption made, etc. with the highest level of detail. Of

course, they must be aware of any incidence occurred in the systems and its

possible implications. In summary, if direct access to the databases is not possi-

ble, detailed metadata should be delivered with the information requested and

fluent communication with data providers is essential during the whole process,

but especially in this initial phases.


References

[1] Ahas R. Armoogum J., Esko S., Ilves M., Karus R., Madre JL., Nurmi O.,

Potier F., Schumucker D., Sonntag Y., Tiru M. (2014). Feasibility Study

on the Use of Mobile Positioning Data for Tourism Statistics, Consolidated

Report, Publications Office of the European Union. EUROSTAT.

[2] Regulation (EU) No 692/2011 of the European Parliament and of the Coun-

cil of 6 July 2011 concerning European statistics on tourism.

[3] United Nations, World Tourism Organization, Eurostat, OECD (2010),

Tourism Satellite Account: Recommended Methodological Framework 2008,

United Nations publication.

About the authors

Fernando Cortina Garcıa is a senior statistician at the National Statistics In-

stitute (INE) of Spain. Public official of the Senior Corps of Statisticians since

1991, he has a broad experience in official statistics. He is currently Deputy

Director of Unit, in S.G. of Tourism, Science and Technology Statistics. Span-

ish representative in international forums and organizations related to tourism,

as UNWTO. Participation in training seminars for statisticians, focusing on

tourism statistics. He has a University degree in Economics, specializing in

Public Administration.

Marıa Izquierdo Valverde Garcıa is a senior statistician at the National

Statistics Institute (INE) of Spain. Public official of the Senior Corps of Statis-

ticians since 2004, she has a broad experience in official statistics, mainly in

household surveys. She is currently in charge of the measurement of domestic

and outbound tourism. Spanish representative in international forums as Euro-

stat and UNWTO. Participation in training seminars for statisticians, focusing

on tourism statistics. She has a University degree in Mathematics, specializing

in Statistics and Operational Research.

Jesus Prado Mascunano is a Head of Unit at the National Statistics Insti-

tute (INE) of Spain. He is currently working at the S.G for Tourism, Science

and Technology Statistics. He has developed his activity in the tourist admin-

istration and the INE. He has a University degree in Economics, specializing in

Public Administration.

Marıa Velasco Gimeno is a senior statistician at the National Statistics In-

stitute (INE) of Spain. Public official of the Senior Corps of Statisticians since

2001, she has a broad experience in official statistics. She is currently Head


of Unit, in charge of Tourist Expenses Survey (S.G. of Tourism, Science and

Technology Statistics). Spanish representative in international forums and or-

ganizations related to tourism, as UNWTO. Participation in training seminars

for statisticians, focusing on tourism statistics. She has a University degree in

Mathematics.


Historia y Ensenanza

Multivariate continuous probability distributions and

partial differential equations: A simple and nice

connection

Julia Calatayud Gregori, Juan Carlos Cortes Lopez

and Marc Jornet Sanz

Departamento de Matematica Aplicada

Instituto Universitario de Matematica Multidisciplinar

Universitat Politecnica de Valencia


B [email protected]

Abstract

We propose a simple method to introduce some relevant multivariate

continuous distributions by establishing and solving specific partial differ-

ential equations satisfied by their corresponding joint survival functions.

The approach is based upon elementary ideas belonging to hazard function

theory.

Keywords: multivariate probability density function, partial differential

equations, hazard function theory.

AMS Subject classifications: 60E05, 62N99, 35A09.

1. Motivacion

En el famoso trabajo de Pearson [5] se introdujo su destacada familia de cur-

vas de frecuencia o funciones de densidad de probabilidad (fdp). Los elementos

de esta familia aparecen como soluciones de la siguiente ecuacion diferencial,

denominada ecuacion diferencial de Pearson,

ρ′(x) =q(x)

p(x)ρ(x) donde

{q(x) = x+ a0,

p(x) = b0 + b1x+ b2x2.

(1.1)

Dependiendo del grado, ∂(p(x)), del polinomio p(x) del denominador: si es cons-

tante (b1 = b2 = 0), lineal (b2 = 0) o cuadratico, y en este ultimo caso, del signo

positivo, negativo o nulo del discriminante, D = b21− 4b0b2, la ecuacion diferen-

cial de Pearson tiene, basicamente, cinco tipos de solucion:

c© 2018 SEIO

150 J. Calatayud, J. C. Cortes, M. Jornet

• Si ∂(p(x)) = 0: Puede demostrarse que ρ(x) es la fdp de una distribucion

Gaussiana.

• Si ∂(p(x)) = 1: Puede demostrarse que ρ(x) pertenece a la familia de

distribuciones Gamma.

• Si ∂(p(x)) = 2 y D = 0: Entonces la fdp ρ(x) es de la forma

ρ(x) = Cx−α exp (−β/x) ,

siendo C una constante de normalizacion apropiada.

• Si ∂(p(x)) = 2 y D < 0: Entonces la fdp ρ(x) puede expresarse de la

siguiente manera

ρ(x) = C(1 + x2)−α exp (−β arctan(x)) ,

siendo C una constante de normalizacion apropiada. En particular, la dis-

tribucion t-Student reescalada pertenece a esta familia.

• Si ∂(p(x)) = 2 y D > 0: En este caso, la fdp ρ(x) puede escribirse en el

siguiente patron

ρ(x) = Cxα−1(1− x)β−1,

siendo C una constante de normalizacion apropiada. Claramente, la dis-

tribucion beta pertenece a esta familia.

La comprobacion de los resultados anteriores es sencilla. Como un ejemplo ilus-

trativo, comprobaremos el primer caso (el resto de casos se pueden chequear de

un modo similar). En efecto, para

ρ(x;µ, σ2) =1√

2πσ2exp

(− (x− µ)2

2σ2

), x ∈ R,

se tieneρ′(x;µ, σ2)

ρ(x;µ, σ2)=q(x)

p(x)donde

{q(x) = x− µ,p(x) = −σ2,

es decir, la ecuacion diferencial de Pearson (1.1) se cumple tomando b0 = −σ2,

b1 = b2 = 0 y a0 = −µ. Un estudio detallado de la ecuacion de Pearson (1.1)

puede consultarse en [3].

Las fdp suelen presentarse como patrones estadısticos que modelizan de-

terminados tipos de fenomenos (la distribucion Beta para modelizar el com-

portamiento aleatorio de variables porcentuales o proporciones; la distribucion

Gaussiana para modelizar estaturas y pesos de individuos; las distribuciones

Exponencial y Gamma para describir tiempos de espera; etc.). Desde el punto

Multivariate continuous probability distributions and partial differential equations151

de vista docente, la ecuacion de Pearson permite introducir algunas familias

importantes de distribuciones estadısticas continuas en conexion con el area las

Ecuaciones Diferenciales Ordinarias. Este aspecto nos parece particularmente

importante desde el punto de vista formativo, y es por ello que en este trabajo

proponemos la introduccion de algunas fdp multivariantes en conexion con la

resolucion de ecuaciones en derivadas parciales. Las ideas que aquı se presen-

tan estan inspiradas en el artıculo previo de Cortes et al. [1] donde se deducen

algunas fdp que son solucion de ciertas ecuaciones diferenciales ordinarias. Con-

cretamente, en dicho trabajo se han introducido las distribuciones Uniforme,

Exponencial, Weibull y Pareto, a partir de las propiedades analogas a las con-

diciones P1 y P2, que se detallan a continuacion, pero como veremos ahora en

el contexto de la Teorıa de Riesgos (Hazard Theory), haciendo uso de la funcion

de supervivencia.

Sea ~X = (X1, . . . , Xn) un vector aleatorio y denotemos por

S(~x) = P(X1 > x1, . . . , Xn > xn)

su funcion de supervivencia, donde ~x = (x1, . . . , xn) ∈ Rn. Esta funcion determi-

na la ley de probabilidad del vector aleatorio ~X y, en dimension n = 1, esta rela-

cionada con la funcion de distribucion, F , mediante la relacion F (~x) = 1−S(~x).

Su uso es comun en Bioestadıstica, debido al interes de estudiar el Tiempo de

Supervivencia de un individuo, vease por ejemplo, [6, Capıtulo 1].

En este trabajo, veremos ejemplos de vectores aleatorios absolutamente con-

tinuos ~X tales que, a partir de las propiedades

• P1: P(X1 ≥ x01, . . . , Xn ≥ x0

n) = 1,

• P2: P(xi ≤ Xi ≤ xi + dxi|X1 ≥ x1, . . . , Xn ≥ xn) = gi(~x, dxi),

i = 1, . . . , n,

determinamos de forma explıcita su funcion de supervivencia, mediante la reso-

lucion de una ecuacion en derivadas parciales.

En el contexto de vectores absolutamente continuos, la funcion de supervi-

vencia S(~x) esta ıntimamente ligada con la fdp f(~x), mediante las relaciones

S(~x) =

∫ ∞x1

· · ·∫ ∞xn

f(y1, . . . , yn) dyn · · · dy1

y∂nS

∂x1 · · · ∂xn(~x) = (−1)nf(~x).

A diferencia de [1], ahora trabajaremos con la funcion de supervivencia en

vez de con la funcion de distribucion, por conveniencia en el futuro desarrollo.

Notemos ademas que, debido a P2, la funcion de supervivencia se intuye impor-


tante en este contexto, pues si X1, . . . , Xn representaran tiempos de supervi-

vencia, gi(~x, dxi) proporcionarıa la probabilidad de muerte instantanea (cuando

dxi ≈ 0) para el individuo i-esimo, condicionado a que todos los individuos ha-

brıan sobrevivido un tiempo x1, . . . , xn, respectivamente (vease el concepto de

funcion de hazard o riesgo en [2]).Por definicion de probabilidad condicionada,

gi(~x, dxi) =P(X1 ≥ x1, . . . , Xi−1 ≥ xi−1, xi ≤ Xi ≤ xi + dxi, Xi+1 ≥ xi+1, . . . , Xn ≥ xn)

P(X1 ≥ x1, . . . , Xn ≥ xn).

El denominador de esta fraccion es igual a S(~x), por definicion. Por otro lado,

el numerador es igual a S(~x)− S(~x+ dxi~ei), donde ~ei = (0, . . . , 0, 1, 0, . . . , 0) es

el vector i-esimo de la base canonica. Ası pues, llegamos a la relacion entre la

funciones gi y S:

gi(~x, dxi) = −S(~x+ dxi~ei)− S(~x)

S(~x), i = 1, . . . , n. (1.2)

A continuacion veremos que, a partir del conocimiento de P1 y P2 y supo-

niendo que la ecuacion anterior existe cuando dxi → 0, podremos determinar la

funcion de supervivencia S(~x). Los resultados se mostraran para algunas distri-

buciones continuas multivariantes, sin pretender agotar todas las posibilidades,

ya que el razonamiento que se presenta es sistematico, lo cual es una ventaja

desde el punto de vista formativo. Por otra parte, cabe senalar la limitacion del

enfoque que se propone, porque no todas las distribuciones continuas multiva-

riantes pueden obtenerse vıa el enfoque que se propone a continuacion. De todas

formas, esta es tambien una limitacion de la ecuacion diferencial de Pearson o

del trabajo presentado en [1].

2. Distribucion exponencial multivariante

Tomemos en P1 ~x0 = (x01, . . . , x

0n) = (0, . . . , 0) y en P2 gi(~x, dxi) = λidxi,

i = 1, . . . , n, siendo λ1, . . . , λn constantes positivas. Ello significa, en termino-

logıa de tiempos de vida, que el riesgo relativo de muerte instantanea para el

individuo i-esimo es una constante λi a lo largo de la vida, pues gi(~x, dxi) no

depende del instante de vida ~x. Esto esta estrechamente relacionado con la falta

de memoria satisfecha por la distribucion exponencial (no se recuerda cuanto

ha vivido el individuo i-esimo). De hecho, veremos que las componentes de ~X

son independientes, con distribucion exponencial Exp(λi) para i = 1, . . . , n.

En efecto, partiendo de la ecuacion (1.2),

λidxi = −S(~x+ dxi~ei)− S(~x)

S(~x)⇒ λiS(~x) = −S(~x+ dxi~ei)− S(~x)

dxi.

Tomando lımites cuando dxi → 0 y teniendo en cuenta la definicion de derivada


parcial, se llega a las ecuaciones en derivadas parciales

λiS(~x) = −∂iS(~x), i = 1, . . . , n, (2.1)

donde ∂iS(~x) = ∂S(x1,...,xi,...,xn)∂xi

denota la derivada parcial de la funcion de

supervivencia S(~x) respecto de la variable i-esima, xi.

Para i = 1 en (2.1), obtenemos

λ1S(~x) = −∂1S(~x),

y resolvemos utilizando el metodo de separacion de variables para ecuaciones

diferenciales ordinarias, en funcion de la variable x1 y fijando el resto de varia-

bles:

S(~x) = c1(x2, . . . , xn)e−λ1x1 , (2.2)

donde c1(x2, . . . , xn) representa cualquier funcion independiente de x1, o dicho

de otra manera, una constante respecto a x1. Tomamos i = 2 en (2.1)

λ2S(~x) = −∂2S(~x),

y sustituyendo S(~x) por nuestra solucion (2.2) y simplificando obtenemos:

λ2c1(x2, . . . , xn) = −∂2c1(x2, . . . , xn). (2.3)

Resolviendo (2.3) por el metodo de separacion de variables para ecuaciones

diferenciales ordinarias, con variable x2 y fijando el resto, llegamos a

c1(x2, . . . , xn) = c2(x3, . . . , xn)e−λ2x2 .

Sustituyendo esta expresion en (2.2), obtenemos

S(~x) = c2(x3, . . . , xn)e−λ1x1−λ2x2 .

Tomando (2.1) para i = 3, y con un procedimiento analogo hasta i = n, se

concluye que

S(~x) = c e−∑ni=1 λixi ,

para cierta constante c, la cual se determina usando P1: 1 = S(0, . . . , 0) = c.

Con todo esto, queda determinada la funcion de supervivencia

S(~x) = e−∑ni=1 λixi ,

la cual se corresponde con un vector aleatorio ~X con componentes X1, . . . , Xn in-

dependientes y distribuidas con ley exponencial Xi ∼ Exp(λi), para i = 1, . . . , n.


3. Distribucion uniforme multivariante

Tomemos en P1 ~x0 = (x01, . . . , x

0n) = (a1, . . . , an) y

gi(~x, dxi) =dxi

bi − xi, xi ∈ (ai, bi), i = 1, . . . , n,

siendo ai < bi, 1 ≤ i ≤ n, numeros reales. Veremos que ~X es un vector aleatorio

uniforme en el rectangulo multidimensional [a1, b1]× · · · × [an, bn].

De (1.2),

1

bi − xidxi = −S(~x+ dxi~ei)− S(~x)

S(~x)⇒ 1

bi − xiS(~x) = −S(~x+ dxi~ei)− S(~x)

dxi.

Tomando lımites cuando dxi → 0, se obtienen las ecuaciones en derivadas par-

ciales1

bi − xiS(~x) = −∂iS(~x), i = 1, . . . , n. (3.1)

Se procede como en el ejemplo anterior. Empezamos con i = 1,

1

b1 − x1S(~x) = −∂1S(~x),

de donde, usando separacion de variables en x1 y fijando x2, . . . , xn, se llega a

S(~x) = c1(x2, . . . , xn)(b1 − x1), (3.2)

siendo c1(x2, . . . , xn) una funcion constante respecto x1. Ahora tomamos (3.1)

con i = 2, sustituyendo la funcion obtenida en (3.2):

1

b2 − x2S(~x) = −∂2S(~x)⇒ 1

b2 − x2c1(x2, . . . , xn) = −∂2c1(x2, . . . , xn)

⇒ c1(x2, . . . , xn) = c2(x3, . . . , xn)(b2 − x2)

⇒ S(~x) = c2(x3, . . . , xn)(b1 − x1)(b2 − x2),

siendo c2(x3, . . . , xn) una funcion independiente de x1 y x2. Procediendo ası,

hasta llegar a i = n, se deduce que S(~x) debe ser

S(~x) = c(b1 − x1) · · · (bn − xn),

para cierta constante c a determinar. Imponiendo la condicion inicial 1 =

S(a1, . . . , an) = c(b1 − a1) · · · (bn − an), el valor de c queda determinado:

c =1

(b1 − a1) · · · (bn − an).


De ello se sigue la funcion de supervivencia,

S(~x) =(b1 − x1) · · · (bn − xn)

(b1 − a1) · · · (bn − an),

la cual se corresponde con la de un vector aleatorio uniforme en el rectangulo

n-dimensional [a1, b1]× · · · × [an, bn].

4. Distribucion exponencial bivariada Gumbel

Tomemos n = 2, la condicion inicial de P1 como (x01, x

02) = (0, 0) y la

propiedad P2 como {g1(x1, x2,dx1) = (1 + θx2)dx1,

g2(x1, x2,dx2) = (1 + θx1)dx2,

siendo θ > 0. Veremos que la funcion de supervivencia viene dada por

S(x1, x2) =

e−(x1+x2+θx1x2), x1, x2 > 0,

e−x1 , x1 > 0, x2 ≤ 0,

e−x2 , x2 > 0, x1 ≤ 0,

0, x1, x2 ≤ 0.

(4.1)

Esta es una de las muchas extensiones que existen de la distribucion exponencial

para el caso bidimensional, cuando las marginales no son independientes. Por

definicion, el termino exponencial bivariada se refiere a distribuciones bivariadas

con marginales distribuidas como una exponencial. Este es el caso de (4.1), pues

S(x1) = lımx2→−∞

S(x1, x2) = e−x1 .

Vease [2, pagina 350].

En terminos de aplicaciones estadısticas, la distribucion exponencial biva-

riada Gumbel (ası como variaciones de ella, vease [2, Capıtulo 47]) se puede

utilizar para modelizar parejas de fenomenos aleatorios X1 y X2 no indepen-

dientes, cuando cada uno de ellos sigue una distribucion exponencial. Como

ejemplo citamos [4], donde se estima el coste de garantıa de motocicletas me-

diante la distribucion exponencial bivariada Gumbel.

De (1.2), {(1 + θx2)dx1 = −S(x1+dx1,x2)−S(x1,x2)

S(x1,x2) ,

(1 + θx1)dx2 = −S(x1,x2+dx2)−S(x1,x2)S(x1,x2) ,


⇒

{(1 + θx2)S(x1, x2) = −S(x1+dx1,x2)−S(x1,x2)

dx1,

(1 + θx1)S(x1, x2) = −S(x1,x2+dx2)−S(x1,x2)dx2

,

⇒

{(1 + θx2)S(x1, x2) = −∂1S(x1, x2),

(1 + θx1)S(x1, x2) = −∂2S(x1, x2).

De la primera ecuacion, usando el metodo de separacion de variables con x1

y fijada x2, se obtiene S(x1, x2) = c(x2)e−x1−θx1x2 , siendo c(x2) una fun-

cion de x2. Aplicando esta expresion en la segunda ecuacion y simplificando,

c′(x2) = −c(x2). Resolviendo por separacion de variables, c(x2) = c e−x2 . Ası,

la funcion de superviviencia viene dada por S(x1, x2) = c e−x1−x2−θx1x2 . Te-

niendo en cuenta P1, que nos dice 1 = S(0, 0) = c, concluimos que

S(x1, x2) = e−x1−x2−θx1x2 ,

como querıamos.

5. Distribucion Weibull bivariada

En P1, tomamos (x01, x

02) = (0, 0), y en P2,{

g1(x1, x2,dx1) = γ(λ1xβ1 + λ2x

β2 )γ−1λ1βx

β−11 dx1,

g2(x1, x2,dx2) = γ(λ1xβ1 + λ2x

β2 )γ−1λ2βx

β−12 dx2,

siendo β, γ, λ1, λ2 > 0. Como hemos hecho hasta ahora, partimos de (1.2) y en

la primera ecuacion llegamos a

γ(λ1xβ1 + λ2x

β2 )γ−1λ1βx

β−11 S(x1, x2) = −∂1S(x1, x2).

Mediante variables separables,

S(x1, x2) = c(x2)e−(λ1xβ1 +λ2x

β2 )γ .

Despues de algunos calculos, es sencillo comprobar que la segunda ecuacion se

reduce a c′(x2) = 0, con lo cual c(x2) = c es constante. Utilizando la condicion

inicial como hemos hecho antes se deduce que c = 1, y por tanto

S(x1, x2) = e−(λ1xβ1 +λ2x

β2 )γ .

Esta es la funcion de supervivencia de un vector aleatorio con distribucion Wei-

bull bivariada, vease [2, pagina 408]. Como aplicacion, citamos [7], en el que se

modeliza la velocidad del viento mediante una distribucion multivariante Wei-

bull.


6. Conclusiones

En este trabajo se ha mostrado un enfoque sencillo y sistematico, basado en

la Funcion de Supervivencia, para presentar algunas funciones de densidad de

probabilidad conjuntas a traves de la resolucion de ciertas ecuaciones diferen-

ciales en derivadas parciales. Dadas unas observaciones aleatorias, se obtiene la

probabilidad de muerte instantanea para el individuo i-esimo, condicionado a

que todos los individuos habrıan sobrevivido hasta el presente, la cual se rela-

ciona con la Funcion de Supervivencia mediante la definicion de probabilidad

condicional. Tomando lımites, se llega a una ecuacion en derivadas parciales para

la Funcion de Supervivencia. Como hemos mostrado, dicha funcion caracteri-

za la densidad de probabilidad conjunta del fenomeno aleatorio. Los ejemplos

de aplicacion de este trabajo incluyen las distribuciones multivariantes expo-

nencial, uniforme, Gumbel y Weibull, con frecuencia utilizadas en Analisis de

Supervivencia, aunque otras familias de probabilidad podrıan ser estudiadas. El

enfoque descrito en este artıculo permite motivar, en conexion con las ecuaciones

en derivadas parciales, la definicion de funcion de Hazard para distribuciones

multivariantes usualmente dada en textos de Analisis de la Superviviencia.

Referencias

[1] Cortes J.-C., Navarro A., Sanchez A., y Calbo G. (2016). Distribuciones de

probabilidad continuas y ecuaciones diferenciales ordinarias. Bol. Soc. Puig

Adam, 101, 42–49.

[2] Kotz S., Balakrishnan N., y Johnson N.L. (2000). Continuous Multivariate

Distributions, Volume 1: Models and Applications, Second edition, Wiley,

New York (USA).

[3] Lee C., Famoye F., y Alzaatreh A.Y. (2013). Methods for generating families

of univariate continuous distributions in the recent decades. WIREs Comput.

Stat., 5, 219–238, Doi: 10.1002/wics.1255.

[4] Pal S., y Murthy G.S.R. (2003). An application of Gumbel’s bivariate expo-

nential distribution in estimation of warranty cost of motor cycles. Int. J.

Qual. Reliab. Manag., 20 (4), 48–502, Doi: 10.1108/02656710310468650.

[5] Pearson K. (1985). Contributions to the mathematical theory of evolution,

II: Skew variation in homogeneous material. Phil. Trans. Royal Soc., 186,

343–414, Doi: 10.1098/rsta.1895.0010.

[6] Smith P.J. (2002). Analysis of Failure and Survival Data, Chapman and

Hall, New York (USA).


[7] Villanueva D., Feijoo A., y Pazos J.L. (2013). Multivariate Weibull Distri-

bution for Wind Speed and Wind Power Behavior Assessment. Resour., 2,

370–384, Doi: 10.3390/resources2030370.

Acerca de los autores

Julia Calatayud Gregori. Graduada en Matematicas por la Universidad de

Valencia (UV) en el ano 2015. Obtuvo el tıtulo de Master en Matematica Avan-

zada por la Universidad de Barcelona (UB) en Julio del 2016 y tambien el tıtulo

de Master en Estadıstica e Investigacion Operativa por la Universidad Politec-

nica de Cataluna (UPC) - Universidad de Barcelona (UB) en enero de 2018.

Actualmente es estudiante de doctorado en la Universidad Politecnica de Va-

lencia (UPV), en el Instituto de Matematica Multidisciplinar. Su investigacion

se centra en el estudio de las Ecuaciones Diferenciales Aleatorias.

Juan Carlos Cortes Lopez. Licenciado y Doctor en Ciencias Matematicas.

Catedratico de Universidad de Matematica Aplicada de la Universitat Politec-

nica de Valencia (UPV). Imparte docencia en el Grado de Administracion y Di-

reccion de Empresas (ADE) de las asignaturas Modelos Matematicos para ADE

y Analisis de Riesgos Financieros, y es profesor responsable de las asignaturas

Ecuaciones Diferenciales Aleatorias y Aplicaciones (Master Interuniversitario en

Investigacion Matematica UPV-Universitat de Valencia) y Modelizacion y Va-

loracion de Opciones Financieras (Master en Direccion Financiera UPV). Su

investigacion se centra en Ecuaciones Diferenciales Aleatorias y Cuantificacion

de la Incertidumbre en Modelizacion Matematica, y la desarrolla en el Instituto

de Matematica Multidisciplinar de la UPV, del cual es Subdirector de Investi-

gacion.

Marc Jornet Sanz. Graduado en Matematicas por la Universidad de Valencia

(UV) en el ano 2015. Obtuvo el tıtulo de Master en Matematica Avanzada por la

Universidad de Barcelona (UB) en Julio del 2016. Actualmente es estudiante de

doctorado en la Universidad Politecnica de Valencia (UPV), en el Instituto de

Matematica Multidisciplinar, subvencionado por la UPV como beneficiario de

un contrato predoctoral FPI-UPV 2017. Su investigacion se centra en el estudio

de las Ecuaciones Diferenciales Aleatorias.


Opiniones sobre la profesion

Natural Language Parsing: Progress and Challenges

Carlos Gomez-Rodrıguez

Universidade da Coruna

FASTPARSE Lab, LyS Research Group

Departamento de Computacion

Facultade de Informatica, Elvina

15071 A Coruna, Spain

[email protected]

Resumen

Natural language parsing is the task of automatically obtaining the

syntactic structure of sentences written in a human language. Parsing is a

crucial step for language processing systems that need to extract meaning

from text or speech, and thus a key technology of artificial intelligence.

This article presents an outline of the current state of the art in this field,

as well as reflections on the main challenges that, in the author’s opinion, it

is currently facing: limitations in accuracy on especially difficult languages

and domains, psycholinguistic adequacy, and speed.

Keywords: Natural language parsing, syntax, artificial intelligence.

AMS Subject classifications: 68T50, 91F20.

1. Analisis sintactico del lenguaje natural

El procesamiento del lenguaje natural es la rama de conocimiento que in-

vestiga la manera de que las maquinas puedan comunicarse con las personas

utilizando lenguajes humanos. Como tal, es un campo interdisciplinar que se

puede enmarcar tanto en la inteligencia artificial como en la linguıstica com-

putacional. Dentro de este campo, el analisis sintactico del lenguaje natural es

la tarea consistente en obtener, de forma automatica mediante un programa de

ordenador, la estructura interna de una oracion.

Aunque la investigacion en analisis sintactico del lenguaje natural tiene varias

decadas de historia, solo recientemente ha pasado de ser un campo de investi-

gacion prometedor a experimentar un uso generalizado en distintas aplicaciones

de inteligencia artificial, como la traduccion automatica [47, 50], reconocimiento

de implicaciones textuales [44], aprendizaje para agentes inteligentes en juegos

c© 2018 SEIO

160 C. Gomez-Rodrıguez

S

VP

NP

N

manzana

D

la

V

comio

NP

N

Juan

Figure 1: arbol de constituyentes para una oracion en espanol. La etiquetade cada nodo interno indica el tipo del constituyente formado por las palabrasque descienden de dicho nodo. Por ejemplo, “la manzana” es una frase nominal(NP), mientras que “comio la manzana” es una frase verbal (VP).

Juan comio la manzana

SUBJ

OBJ

DET

Figure 2: arbol de dependencias para una oracion en espanol. Cada depen-dencia se representa como una flecha, que va de una palabra (padre) a otra(dependiente). Por ejemplo, la palabra “Juan” depende de ”comio”, y su tipo dedependencia es SUBJ (sujeto).

[2] o analisis de sentimiento [25, 48]; y convertirse en un componente clave de

los sistemas desplegados por las grandes companıas de servicios informaticos,

como IBM [38] o Google1.

La estructura de las oraciones se puede describir mediante distintas repre-

sentaciones sintacticas, dependiendo de la teorıa linguıstica que uno siga. En el

analisis sintactico de constituyentes, o analisis sintactico de estructura de frase,

la estructura de una oracion se representa mediante un arbol que la divide en

unidades mas pequenas llamadas constituyentes, que a su vez se dividen en otros

constituyentes mas pequenos hasta llegar al nivel de palabras individuales, como

en la Figura 1. La otra representacion predominante es la del analisis sintactico

de dependencias, en la que la estructura de la oracion se expresa mediante un ar-

bol o bosque compuesto de relaciones binarias dirigidas entre palabras, llamadas

dependencias, cada una de las cuales enlaza un padre con un dependiente, como

en la Figura 2.

Aunque la adecuacion linguıstica de cada una de estas representaciones es

1Google Cloud Natural Language: https://cloud.google.com/natural-language/

https://cloud.google.com/natural-language/

Natural Language Parsing: Progress and Challenges 161

un tema polemico entre los sintacticistas – vease por ejemplo [7, 24] –, los que

estamos mas sesgados hacia la ingenierıa tendemos a centrarnos en “cualquier

cosa mientras funcione”. Con “funcionar”, en este caso, nos referimos a ser

capaz de proporcionar una representacion de las oraciones que sea util a las

aplicaciones que la utilicen para extraer informacion sobre el significado del

texto, y a hacerlo de manera tan precisa y eficiente como sea posible. Desde

este punto de vista, ambas representaciones tienen sus meritos. La sintaxis de

dependencias es en la actualidad la aproximacion predominante en linguıstica

computacional y procesamiento del lenguaje natural, dado que se puede decir

que es mas simple (la representacion resultante no tiene mas nodos que las

propias palabras de entrada), haciendo posible la creacion de algoritmos mas

eficientes, y la salida proporciona una representacion sencilla y transparente del

significado de la oracion (el arbol de la Figura 2 nos dice que accion sucede en

la oracion, quien la lleva a cabo y quien la recibe: Juan, el sujeto, se comio

la manzana, el objeto). Sin embargo, tampoco procede ignorar la sintaxis de

constituyentes: ademas de proporcionar informacion distinta y complementaria

a la representada por las dependencias [26], existe la paradoja de que algunos de

los mejores analizadores sintacticos de dependencias son en realidad analizadores

de constituyentes [34], donde primero se obtiene un arbol de constituyentes para

despues transformarlo en dependencias a traves de reglas heurısticas.

Independientemente de la representacion que se use, la principal dificultad

de conseguir que las maquinas analicen correctamente los lenguajes humanos

se localiza en una de las caracterısticas fundamentales de dichos lenguajes: su

ambiguedad. Una oracion dada puede tener diferentes significados, todos ellos

sintacticamente correctos. Por ejemplo, en la oracion “Juan vio un hombre con

un telescopio”, ¿uso Juan un telescopio para ver al hombre (como se refleja

en el arbol de constituyentes de la Figura 3, donde la frase preposicional que

menciona el telescopio es independiente de la frase nominal que hace referencia

al hombre)? ¿O era el hombre el que llevaba un telescopio (como en el arbol

de constituyentes de la Figura 4, donde la frase preposicional esta anexa a la

frase nominal)? Un humano podrıa desambiguar la oracion a partir del contexto

(o preguntar, si no estuviera claro). En el caso de una maquina, recurrimos a

modelos probabilısticos o de aprendizaje automatico para intentar resolver las

ambiguedades, con la ayuda de datos etiquetados. Profundizaremos en esto en la

siguiente seccion, que presenta brevemente diferentes aproximaciones populares

para el analisis sintactico.

2. Modelos de analisis sintactico y estado del arte

Las primeras aproximaciones al analisis sintactico del lenguaje natural que

se pueden considerar exitosas, en el sentido de proporcionar analisis utiles para

oraciones reales, fueron analizadores estadısticos de constituyentes basados en


S

VP

PP

NP

N

telescopio

D

un

P

con

NP

N

hombre

D

un

V

vio

NP

N

Juan

Figure 3: En esta interpretacion de la oracion, Juan uso un telescopio para vera un hombre.

S

VP

NP

PP

NP

N

telescopio

D

un

P

con

N

hombre

D

un

V

vio

NP

N

Juan

Figure 4: En esta interpretacion de la oracion, Juan vio a un hombre que llevabaun telescopio.


gramaticas independientes del contexto probabilısticas.

Informalmente hablando, una gramatica independiente del contexto es una

descripcion de la sintaxis de un lenguaje dado por medio de una serie de reglas

que describen las estructuras que puedan aparecer en un arbol de constituyentes.

Por ejemplo, para describir como una frase verbal (VP) se puede dividir en

constituyentes mas pequenos, podrıamos tener una regla VP→ V NP PP (que

dice que un constituyente de tipo VP puede tener como hijos, de izquierda

a derecha, los constituyentes V, NP y PP, como en la figura 3) y una regla

VP→ V NP (afirmando que un constituyente VP puede tener a V y NP como

hijos, como en la Figura 4). Si escribimos una gramatica independiente del

contexto completa para describir la sintaxis de un idioma dado, como puede ser

el ingles, podremos utilizar un algoritmo de programacion dinamica, como el

algoritmo CKY [28] o el algoritmo de Earley [9], para recuperar los arboles de

constituyentes que sean compatibles con una oracion dada (vease [17] para una

revision y formalizacion de distintos algoritmos de programacion dinamica de

este tipo, sus caracterısticas y las relaciones entre ellos).

Sin embargo, analizar con gramaticas independientes del contexto puras tiene

dos limitaciones importantes cuando damos el salto desde ejemplos “de juguete”

a oraciones reales: en primer lugar, construir una gramatica completa de un

idioma a mano es muy difıcil (o incluso imposible, dado que el idioma esta en

constante evolucion); y en segundo lugar, necesitamos tratar la ambiguedad: en

muchos casos pueden existir varios arboles compatibles con la gramatica (como

los de las Figuras 3 y 4) y para la mayorıa de las aplicaciones, es necesario elegir

el mejor de ellos.

Ambos problemas pueden ser mitigados haciendo que las gramaticas inde-

pendientes del contexto sean probabilısticas. Para ello, a cada regla de la forma

X → α (donde α es una cadena de sımbolos) le asociamos una probabilidad

P (X→ α), de tal modo que

∑α

P (X→ α) = 1.

Por ejemplo, la regla VP→ V NP PP podrıa tener probabilidad 0.75 en una

gramatica independiente del contexto dada, y la regla VP→ V NP probabilidad

0.25, queriendo decir que es tres veces mas probable que una frase verbal (VP)

se divida en V NP PP que en V NP. Notese que las probabilidades de todas las

reglas que comparten la misma parte izquierda deben sumar 1. De esta manera,

una gramatica independiente del contexto es un modelo de analisis sintactico

generativo, un modelo de la probabilidad conjunta P (x, y) de la entrada y la

salida del proceso de analisis.

En la practica, podemos extraer una gramatica independiente del contexto

probabilıstica a partir de un treebank [3], es decir, una coleccion de oraciones


que han sido anotadas con su correspondiente arbol sintactico (presumiblemente

correcto) por parte de humanos, como puede ser el English Penn Treebank [37].

Esto resulta menos costoso que escribir una gramatica a mano, y ademas nos

proporciona una manera sencilla de obtener estimaciones de maxima verosimili-

tud para las probabilidades de las reglas, a partir de sus frecuencias relativas (por

ejemplo, la probabilidad de VP→ V NP PP se estimara como el numero de ve-

ces en el treebank que VP se divide en V NP PP, dividido por el numero total de

apariciones de constituyentes de tipo VP). Una vez que tenemos una gramatica

independiente del contexto probabilıstica, se pueden adaptar facilmente los al-

goritmos de programacion dinamica como CYK o Earley para proporcionar el

arbol de analisis mas probable para una oracion dada. Para ello, se computa la

probabilidad de cada arbol como el producto de las probabilidades de las reglas

utilizadas, escogiendo el arbol mas probable, y resolviendo de esta manera la

ambiguedad. Con la ayuda de las mejoras en el hardware conseguidas en los

anos noventa, esta aproximacion produjo los que se pueden considerar primeros

analizadores practicos del lenguaje natural, que producıan buenos resultados de

precision sobre oraciones reales [4].

Sin embargo, una debilidad importante de estos modelos son las fuertes su-

posiciones de independencia que hacen: el modelo generativo supone que la

probabilidad de cada posible descomposicion de una frase verbal (VP) es in-

dependiente del contexto en el cual esta aparece, lo cual evidentemente no es

cierto en el lenguaje real. Para mitigar el problema se han propuesto diver-

sas tecnicas de refinamiento de gramaticas, incluyendo la lexicalizacion [6], la

markovizacion [30] o la division y fusion de categorıas [45], consiguiendo mejo-

ras de precision. Otra opcion es utilizar gramaticas suavemente sensibles al

contexto, que son formalismos gramaticales mas sofisticados que pueden tener

en cuenta (hasta cierto punto) el contexto [27], aunque esto tiene un coste en

cuanto a complejidad computacional.

Aunque estos analizadores estadısticos dirigidos por gramaticas proporcionaron

los primeros resultados practicos en analisis sintactico del lenguaje natural, los

algoritmos de programacion dinamica que utilizan son bastante lentos (con com-

plejidad cubica en el mejor de los casos, y consiguiendo velocidades de unas pocas

oraciones por segundo en la practica, dado el gran tamano de las gramaticas ex-

traıdas automaticamente que utilizan). A principios de la primera decada del

siglo XXI, se comenzo a trabajar en aproximaciones mas ligeras al analisis sin-

tactico, dirigiendo cada vez mas interes a las representaciones de dependencias

(mas simples, al no requerir la creacion de nodos intermedios durante el anali-

sis) y a modelos puramente dirigidos por los datos que pudiesen ser entrenados

directamente sobre los treebanks, sin necesitar una gramatica explıcita.

En particular, la mayorıa de los modelos de este tipo se pueden dividir en

dos grandes familias de sistemas que han conseguido buenos resultados y son


ampliamente usadas. La primera de estas familias es la de los analizadores

de dependencias basados en transiciones [41]. En estos algoritmos, el proceso

de analisis se modela como una maquina de estados no determinista. En cada

estado, el analizador debe decidir entre diferentes transiciones, que pueden crear

dependencias entre palabras, de forma que el proceso completo producira uno

u otro arbol de analisis dependiendo de la secuencia de transiciones que se

haya seguido. Para entrenar modelos que puedan escoger el analisis adecuado

para cada oracion, necesitamos un mecanismo para puntuar las secuencias de

transiciones y un algoritmo de busqueda para intentar encontrar una secuencia

de alta puntuacion.

En los primeros analizadores basados en transiciones, los modelos de pun-

tuacion mas populares eran clasificadores como las maquinas de vectores soporte

[49] o el perceptron estructurado [52], entrenados directamente a partir de los

analisis de dependencias contenidos en un treebank. El modelo se entrena para

dar alta puntuacion a aquellas transiciones que conduzcan a un arbol correcto,

obteniendo un sistema final que aproxima un “oraculo” que elige la mejor tran-

sicion en los arboles del conjunto de entrenamiento. Las caracterısticas que

se pasan como entrada al clasificador pueden capturar informacion contextual,

relajando las suposiciones de independencia con respecto a otros tipos de mode-

los. Como algoritmo de busqueda para encontrar una secuencia de transiciones

adecuada, se puede usar la busqueda voraz [41] o la busqueda en haz (beam

search) [52] si la velocidad es una prioridad. La programacion dinamica [32, 22]

tambien es una opcion si se quiere garantizar inferencia exacta (es decir, obtener

una secuencia de transiciones con puntuacion maxima).

La otra principal aproximacion al analisis sintactico de dependencias dirigido

por los datos se llama analisis basado en grafos. Los analizadores basados en

grafos funcionan puntuando posibles fragmentos del arbol de dependencias y

despues juntandolos. Esto se puede hacer bien mediante programacion dinamica

de manera similar a los analizadores basados en gramaticas independientes del

contexto probabilısticas, pero sin usar una gramatica [10, 46], o bien mediante

un algoritmo para el calculo del arbol de expansion maximo [39].

La flexibilidad de los analizadores de dependencias dirigidos por los datos, su

buen equilibrio entre velocidad y precision, y la reciente aparicion de treebanks

de dependencias en un gran numero de idiomas [42] han convertido a estos mod-

elos en los predominantes en el procesamiento del lenguaje natural. De hecho,

muchos de los avances desarrollados primero en analizadores de dependencias

dirigidos por los datos se han adaptado despues a analizadores de constituyentes:

por ejemplo, en la actualidad los modelos basados en transiciones tambien se

utilizan para el analisis de constituyentes, dado que son mucho mas rapidos que

los modelos basados en gramaticas [35, 12].

Por ultimo, un desarrollo reciente que ha tenido gran impacto tanto en el


He gave a talk yesterday about parsing

Figure 5: arbol de dependencias para una oracion en ingles con dependenciascruzadas.

analisis sintactico de constituyentes como en el de dependencias es la aplicacion

de las redes neuronales profundas [36]. Para aplicar estos modelos continuos

a las unidades discretas de significado que aparecen en los lenguajes humanos,

las palabras se transforman primero en vectores de numeros reales, llamados

word embeddings [40]. Esto hace posible usar modelos de aprendizaje profundo

para todo tipo de tareas de procesamiento del lenguaje natural, incluyendo el

analisis sintactico. Aunque usar redes “feed-forward” como clasificador para

un modelo basado en transiciones proporciona analizadores muy rapidos [5],

las mejores cifras de precision en la actualidad se consiguen con redes neu-

ronales recurrentes, como las LSTMs [23], que proporcionan representaciones

vectoriales de las palabras enriquecidas por su contexto en la oracion. Dichas

redes se han aplicado con exito para mejorar la precision de modelos de anali-

sis diversos, incluyendo analizadores de dependencias basados en transiciones

[29], analizadores de dependencias basados en arbol de expansion maximo [8],

analizadores de dependencias de programacion dinamica [22] y analizadores de

constituyentes basados en transiciones [34].

3. Desafıos

Gracias a los avances resumidos en la seccion anterior, la precision de los

modelos de analisis sintactico ha venido aumentando hasta un punto en el que

es comparable al grado de consenso entre anotadores expertos sobre textos pe-

riodısticos en ingles [1]. Sin embargo, los buenos resultados en este caso rel-

ativamente facil no se generalizan a idiomas y dominios mas complicados, ası

que el analisis sintactico del lenguaje natural esta lejos de ser un problema re-

suelto. Los recursos computacionales que necesitan los algoritmos de analisis,

y la busqueda de modelos psicolinguısticamente plausibles, son otros desafıos a

los que se enfrenta este campo de investigacion, que se resumen a continuacion.

3.1. Analizando idiomas y dominios difıciles

El ingles, el idioma al que tradicionalmente se ha venido dedicando un mayor

esfuerzo en la literatura de analisis sintactico, es un lenguaje atıpicamente sen-

cillo de analizar. Su orden de palabras relativamente inflexible y su morfologıa


simple, junto con la abundancia de recursos y datos de entrenamiento, hacen

que sea relativamente facil conseguir alta precision en modelos de analisis para

el ingles, especialmente en dominios donde los textos tienden a estar escritos en

lenguaje estandar que sigue las normas gramaticales, como es el dominio peri-

odıstico. Sin embargo, existen varios factores que hacen que algunos idiomas

sean mucho mas difıciles de analizar:

• Dependencias cruzadas: el ingles tiene una proporcion relativamente baja

de dependencias cruzadas, aquellas cuyas flechas se cruzan cuando se dibu-

jan sobre las palabras de la oracion, como sucede con la dependencia en-

tre “gave” y “yesterday” y la dependencia entre “talk” y “about” en la

Figura 5. Por este motivo, muchos analizadores disenados para el in-

gles (o para otros idiomas con esta misma caracterıstica, como pueden

ser el chino o el japones) son proyectivos, es decir, no pueden construir

dependencias cruzadas en absoluto. Esto hace que estos algoritmos sean

mas eficientes, dado que soportar dependencias cruzadas requiere anadir

transiciones o estructuras de datos adicionales a los modelos basados en

transiciones [21, 13], aumentar la complejidad computacional de los algo-

ritmos de programacion dinamica [19, 22], utilizar gramaticas mas comple-

jas [27] o recurrir a los analizadores basados en arbol de expansion maximo

[8]. Ademas de ser mas lentos, los algoritmos que soportan dependencias

cruzadas tienden a obtener precisiones mas bajas debido a la dificultad

adicional de aprender este tipo de relaciones.

• Necesidad de segmentacion: la division de las oraciones en palabras es una

tarea relativamente facil en ingles, igual que en otros idiomas que utilizan

espacios para separar las palabras, ya que estos resultan de gran ayuda.

Sin embargo, idiomas como el chino se escriben sin espacios entre las

palabras, haciendo falta aplicar un paso previo de segmentacion (division

en palabras) que puede introducir ruido adicional en el proceso de analisis

sintactico.

• Morfologıa rica: ciertos idiomas, como por ejemplo el arabe [11] o el turco

[43], son lenguajes sinteticos, es decir, tienen un elevado numero de mor-

femas que modifican el sentido de cada palabra. La morfologıa tiene una

fuerte interaccion con la sintaxis, y esta morfologıa compleja hace que el

analisis sintactico sea notablemente mas difıcil.

• Textos ruidosos: el uso del lenguaje tıpico de las redes sociales, como el que

se ve en los mensajes de Twitter (tuits), contiene numerosos fenomenos

linguısticos que divergen de la norma gramatical estandar, como pueden

ser los emoticonos, los “hashtags” o los frecuentes errores ortograficos,

requiriendo tecnicas especıficas para tratarlos [31].


• Lenguajes con pocos recursos: un factor esencial para obtener una alta pre-

cision en el analisis sintactico y otras tareas de procesamiento del lenguaje

natural es la calidad y la cantidad de los datos de entrenamiento. Algunos

idiomas, como pueden ser el ingles, espanol, aleman o chino, cuentan con

cantidades relativamente grandes de datos anotados que se pueden uti-

lizar para entrenar modelos de analisis sintactico, pero este no es el caso

de muchos otros idiomas, sobre todo los que tienen un pequeno numero

de hablantes.

El lector interesado puede consultar [51] para ver las precisiones obtenidas

en diferentes idiomas y conjuntos de datos por analizadores del estado del arte

actual, participantes en una competicion reciente. La precision varıa entre mas

del 90% de dependencias correctas en varios idiomas, y menos del 30% para

el kazajo, una lengua turquica de morfologıa sintetica para el que los datos de

entrenamiento disponibles son muy escasos.

3.2. Analizando mas rapido

Aunque los actuales algoritmos de analisis sintactico son lo suficientemente

precisos para ser utiles en aplicaciones finales, al menos para algunos idiomas y

dominios, sus requerimientos en cuanto a tiempo de computacion son todavıa

un importante obstaculo que limita la adopcion generalizada de esta tecnologıa

y la extension de sus aplicaciones. Los analizadores mas precisos basados en

gramaticas de constituyentes procesan menos de 5 oraciones por segundo en

CPUs recientes [33], mientras que los analizadores neuronales dirigidos por los

datos consiguen velocidades de unas pocas decenas de oraciones por segundo

(veanse, por ejemplo, las cifras de [12]). El analisis de dependencias puede ser

algo mas rapido, pero de todos modos, es difıcil conseguir mas de 100 oraciones

por segundo en los modelos neuronales recientes sin recurrir a hardware paralelo

[29].

Estas velocidades pueden ser suficientes para aplicaciones que requieren so-

lamente analizar una o unas pocas oraciones de cada vez, como los sistemas de

dialogo, pero resultan prohibitivas para el analisis sintactico a gran escala (por

ejemplo, si se quieren analizar grandes colecciones de documentos obtenidas de

Internet). El problema es incluso mas serio en idiomas que plantean alguno de

los desafıos adicionales explicados mas arriba, como por ejemplo una proporcion

significativa de dependencias cruzadas o una morfologıa rica, lo cual hace que

los requisitos computacionales sean todavıa mas altos [19].

El proyecto FASTPARSE, financiado por el ERC y actualmente en pro-

greso [20], pretende conseguir analizadores mas rapidos, combinando tecnicas

de informatica y matematicas (utilizando razonamiento basado en casos para

reutilizar resultados parciales previos en lugar de volver a analizar subestruc-

turas ya conocidas), ciencia cognitiva (creando modelos de analisis inspirados


en como los humanos resolvemos la tarea) y linguıstica (analizando patrones en

las anotaciones y explotandolos para crear algoritmos mas rapidos). De este

modo, pretendemos acabar con un cuello de botella fundamental para abrir las

aplicaciones de procesamiento del lenguaje natural a su aplicacion a gran escala,

incluso sin la necesidad de las enormes cantidades de recursos computacionales

que solamente las grandes empresas tecnologicas pueden desplegar.

3.3. Plausibilidad psicolinguıstica

Un tercer desafıo relacionado con el analisis sintactico del lenguaje natural es

el de conseguir analizadores que sean psicolinguısticamente plausibles, es decir,

que analicen las oraciones de forma similar a como lo hacemos los humanos.

Por un lado, esto es importante porque puede servir para avanzar nuestro

conocimiento y comprension del procesamiento del lenguaje por parte de los

humanos, y de la propia evolucion del lenguaje. Recientemente, la creciente

disponibilidad de datos anotados sintacticamente (treebanks) ha abierto un

campo de investigacion cuantitativo sobre las propiedades universales de la sin-

taxis de las lenguas humanas: el analisis de los treebanks ha proporcionado

evidencias de que los idiomas tienden a ordenar las palabras de manera que se

minimiza la longitud de las dependencias [16], y mostrado la relacion entre dicha

longitud y la presencia de dependencias cruzadas [14], cuya escasez, dada por

supuesto por muchos linguistas durante decadas, solo se ha confirmado reciente-

mente con evidencia estadıstica solida [15]. Un conocimiento mas detallado de

los aspectos cuantitativos y estadısticos de la sintaxis humana, que por ahora se

halla en su infancia, podrıa a su vez ser de ayuda para disenar modelos de anali-

sis sintactico que se ajustaran mejor a las clases de estructuras que aparecen en

las oraciones reales.

Por otra parte, como he observado recientemente en [18], las estrategias de

analisis existentes que (incluso involuntariamente) se parecen a los modelos de

procesamiento humano del lenguaje, en aspectos como analizar las oraciones

de izqueirda a derecha o usar una cantidad restringida de memoria, tienden a

proporcionar modelos precisos y eficientes. Parece tener sentido plantearse como

hipotesis que, dado que los idiomas humanos evolucionaron junto con la mente

humana y de tal manera que los seres humanos pudiesen procesarlos de manera

eficiente, deberıa ser posible obtener excelentes modelos de analisis imitando de

forma mas cercana el procesamiento humano.

4. Conclusiones

En este artıculo, he resumido brevemente el campo de investigacion del

analisis sintactico del lenguaje natural, presentando su relevancia practica, las

principales aproximaciones estadısticas y de aprendizaje automatico que se han

aplicado, y mi vision particular sobre los principales desafıos que ofrece para


su investigacion presente y futura, y a los que creo que se deberıan dedicar

especiales esfuerzos en los proximos anos. Estos ultimos pueden ser un fertil

campo de investigacion tanto para informaticos como para linguistas, cientıficos

cognitivos, matematicos y estadısticos.

References

[1] Berzak, Y., Huang, Y., Barbu, A., Korhonen, A. y Katz, B. (2016) An-

choring and agreement in syntactic annotations, Proceedings of the 2016

Conference on Empirical Methods in Natural Language Processing, Associ-

ation for Computational Linguistics, Austin, Texas, pp. 2215–2224.

[2] Branavan, S. R. K., Silver, D. y Barzilay, R. (2012). Learning to win by

reading manuals in a monte-carlo framework, J. Artif. Int. Res. 43(1): 661–

704.

[3] Charniak, E. (1996) . Tree-bank grammars, Proceedings of the National

Conference on Artificial Intelligence, pp. 1031–1036.

[4] Charniak, E. (2000) . A maximum-entropy-inspired parser, Proceedings

of the 1st North American Chapter of the Association for Computational

Linguistics Conference, NAACL 2000, Association for Computational Lin-

guistics, Stroudsburg, PA, USA, pp. 132–139.

[5] Chen, D. y Manning, C. (2014) . A fast and accurate dependency parser

using neural networks, Proceedings of the 2014 Conference on Empirical

Methods in Natural Language Processing (EMNLP), Association for Com-

putational Linguistics, pp. 740–750.

[6] Collins, M. (2003). Head-driven statistical models for natural language

parsing, Comput. Linguist. 29(4): 589–637.

[7] Dahl, O. (1980). Some arguments for higher nodes in syntax: a reply to

Hudson’s ‘Constituency and dependency’, Linguistics 18: 485–488.

[8] Dozat, T., Qi, P. y Manning, C. D. (2017). Stanford’s graph-based neu-

ral dependency parser at the conll 2017 shared task, Proceedings of the

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Univer-

sal Dependencies, Association for Computational Linguistics, Vancouver,

Canada, pp. 20–30.

[9] Earley, J. (1970). An efficient context-free parsing algorithm, Communica-

tions of the ACM 13(2): 94–102.

[10] Eisner, J. M. (1996). Three new probabilistic models for dependency pars-

ing: An exploration, Proceedings of the 16th conference on Computational


linguistics-Volume 1, Association for Computational Linguistics, pp. 340–

345.

[11] Farghaly, A. y Shaalan, K. (2009). Arabic natural language processing:

Challenges and solutions, ACM Transactions on Asian Language Informa-

tion Processing (TALIP) 8(4): 14:1–14:22.

[12] Fernandez-Gonzalez, D. y Gomez-Rodrıguez, C. (2018a). Faster shift-

reduce constituent parsing with a non-binary, bottom-up strategy, arXiv

1804.07961 [cs.CL].

[13] Fernandez-Gonzalez, D. y Gomez-Rodrıguez, C. (2018b) . Non-projective

dependency parsing with non-local transitions, Proceedings of the 2018

Conference of the North American Chapter of the Association for Com-

putational Linguistics: Human Language Technologies, Volume 2 (Short

Papers), Association for Computational Linguistics, pp. 693–700.

[14] Ferrer-i-Cancho, R. y Gomez-Rodrıguez, C. (2016) . Crossings as a side

effect of dependency lengths, Complexity 21(S2): 320–328.

[15] Ferrer-i-Cancho, R., Gomez-Rodrıguez, C. y Esteban, J. L. (2018) . Are

crossing dependencies really scarce?, Physica A: Statistical Mechanics and

its Applications 493: 311–329.

[16] Futrell, R., Mahowald, K. y Gibson, E. (2015) . Large-scale evidence of de-

pendency length minimization in 37 languages, Proceedings of the National

Academy of Sciences 112(33): 10336–10341.

[17] Gomez-Rodrıguez, C. (2010). Parsing Schemata for Practical Text Anal-

ysis, Vol. 1 of Mathematics, Computing, Language, and Life: Frontiers in

Mathematical Linguistics and Language Theory, Imperial College Press.

[18] Gomez-Rodrıguez, C. (2016a) . Natural language processing and the Now-

or-Never bottleneck, Behavioral and Brain Sciences 39: e74.

[19] Gomez-Rodrıguez, C. (2016b) . Restricted non-projectivity: Coverage vs.

efficiency, Comput. Linguist. 42(4): 809–817.

[20] Gomez-Rodrıguez, C. (2017) . Towards fast natural language parsing:

FASTPARSE ERC Starting Grant, Procesamiento del Lenguaje Natural

59: 121–124.

[21] Gomez-Rodrıguez, C. y Nivre, J. (2013) . Divisible transition systems and

multiplanar dependency parsing, Comput. Linguist. 39(4): 799–845.


[22] Gomez-Rodrıguez, C., Shi, T. y Lee, L. (2018). Global transition-based

non-projective dependency parsing, Proceedings of ACL, Association for

Computational Linguistics, Melbourne, Australia, p. (To appear).

[23] Hochreiter, S. y Schmidhuber, J. (1997) . Long short-term memory, Neural

Comput. 9(8): 1735–1780.

[24] Hudson, R. A. (2007) . Language Networks: The New Word Grammar,

Oxford University Press, Oxford, UK.

[25] Joshi, M. y Penstein-Rose, C. (2009). Generalizing dependency features

for opinion mining, Proceedings of the ACL-IJCNLP 2009 Conference Short

Papers, ACLShort ’09, Association for Computational Linguistics, Strouds-

burg, PA, USA, pp. 313–316.

[26] Kahane, S. y Mazziotta, N. (2015). Syntactic polygraphs. a formalism ex-

tending both constituency and dependency, Proceedings of the 14th Meeting

on the Mathematics of Language (MoL 2015), Association for Computa-

tional Linguistics, Chicago, USA, pp. 152–164.

[27] Kallmeyer, L. (2010) . Parsing Beyond Context-Free Grammars, 1st edn,

Springer Publishing Company, Incorporated.

[28] Kasami, T. (1965) . An efficient recognition and syntax algorithm for

context-free languages, Scientific Report AFCRL-65-758, Air Force Cam-

bridge Research Lab., Bedford, Massachussetts.

[29] Kiperwasser, E. y Goldberg, Y. (2016) . Simple and accurate dependency

parsing using bidirectional lstm feature representations, Transactions of

the Association for Computational Linguistics 4: 313–327.

[30] Klein, D. y Manning, C. D. (2002). Fast exact inference with a fac-

tored model for natural language parsing, Proceedings of the 15th Inter-

national Conference on Neural Information Processing Systems, NIPS’02,

MIT Press, Cambridge, MA, USA, pp. 3–10.

[31] Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C. y Smith,

N. A. (2014). A dependency parser for tweets, Proceedings of the 2014 Con-

ference on Empirical Methods in Natural Language Processing (EMNLP),

Association for Computational Linguistics, Doha, Qatar, pp. 1001–1012.

[32] Kuhlmann, M., Gomez-Rodrıguez, C. y Satta, G. (2011). Dynamic pro-

gramming algorithms for transition-based dependency parsers, Proceedings

of the 49th Annual Meeting of the Association for Computational Linguis-

tics: Human Language Technologies (ACL 2011), Association for Compu-

tational Linguistics, Portland, Oregon, USA, pp. 673–682.


[33] Kummerfeld, J. K., Hall, D., Curran, J. R. y Klein, D. (2012). Parser

showdown at the wall street corral: An empirical investigation of error

types in parser output, Proceedings of the 2012 Joint Conference on Em-

pirical Methods in Natural Language Processing and Computational Natural

Language Learning, Association for Computational Linguistics, Jeju Island,

Korea, pp. 1048–1059.

[34] Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C., Neubig, G. y Smith,

N. A. (2017). What do recurrent neural network grammars learn about

syntax?, Proceedings of the 15th Conference of the European Chapter of

the Association for Computational Linguistics: Volume 1, Long Papers,

Association for Computational Linguistics, Valencia, Spain, pp. 1249–1258.

[35] Liu, J. y Zhang, Y. (2017). In-order transition-based constituent parsing,

arXiv preprint arXiv:1707.05000 .

[36] Manning, C. D. (2015). Computational linguistics and deep learning, Com-

putational Linguistics 41(4): 701–707.

[37] Marcus, M. P., Marcinkiewicz, M. A. y Santorini, B. (1993). Building a

large annotated corpus of english: The penn treebank, Comput. Linguist.

19(2): 313–330.

[38] McCord, M. C., Murdock, J. W. y Boguraev, B. (2012). Deep parsing in

Watson, IBM Journal of Research and Development 56(3/4): 3:1–3:15.

[39] McDonald, R., Pereira, F., Ribarov, K. y Hajic, J. (2005). Non-projective

dependency parsing using spanning tree algorithms, Proceedings of the con-

ference on Human Language Technology and Empirical Methods in Natural

Language Processing (EMNLP 2005), Association for Computational Lin-

guistics, pp. 523–530.

[40] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. y Dean, J. (2013). Dis-

tributed representations of words and phrases and their compositionality,

Proceedings of the 26th International Conference on Neural Information

Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., USA,

pp. 3111–3119.

[41] Nivre, J. (2008). Algorithms for deterministic incremental dependency

parsing, Computational Linguistics 34(4): 513–553.

[42] Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning,

C. D., McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N. et al. (2016) .

Universal dependencies v1: A multilingual treebank collection., LREC.


[43] Oflazer, K. (2014) . Turkish and its challenges for language processing,

Lang. Resour. Eval. 48(4): 639–653.

[44] Pado, S., Noh, T.-G., Stern, A., Wang, R. y Zanoli, R. (2015). Design

and realization of a modular architecture for textual entailment., Natural

Language Engineering 21(2): 167–200.

[45] Petrov, S., Barrett, L., Thibaux, R. y Klein, D. (2006). Learning accurate,

compact, and interpretable tree annotation, Proceedings of the 21st In-

ternational Conference on Computational Linguistics and the 44th Annual

Meeting of the Association for Computational Linguistics, ACL-44, Associ-

ation for Computational Linguistics, Stroudsburg, PA, USA, pp. 433–440.

[46] Pitler, E. (2014). A crossing-sensitive third-order factorization for depen-

dency parsing, Transactions of the Association for Computational Linguis-

tics 2: 41–54.

[47] Song, M., Kim, W. C., Lee, D., Heo, G. E. y Kang, K. Y. (2015). PKDE4J:

entity and relation extraction for public knowledge discovery, Journal of

Biomedical Informatics 57: 320–332.

[48] Vilares, D., Gomez-Rodrıguez, C. y Alonso, M. A. (2017). Universal, un-

supervised (rule-based), uncovered sentiment analysis, Knowledge-Based

Systems 118: 45 – 55.

[49] Yamada, H. y Matsumoto, Y. (2003). Statistical dependency analysis with

support vector machines, Proceedings of IWPT, Vol. 3, Nancy, France,

pp. 195–206.

[50] Yu, M., Gormley, M. R. y Dredze, M. (2015). Combining word embeddings

and feature embeddings for fine-grained relation extraction, Proceedings of

the 2015 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, Association for

Computational Linguistics, Denver, Colorado, pp. 1374–1379.

[51] Zeman, D., Popel, M., Straka, M., Hajic, J., Nivre, J. et al. (2017). Conll

2017 shared task: Multilingual parsing from raw text to universal depen-

dencies, Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing

from Raw Text to Universal Dependencies, Association for Computational

Linguistics, Vancouver, Canada, pp. 1–19.

[52] Zhang, Y. y Nivre, J. (2011). Transition-based dependency parsing with

rich non-local features, Proceedings of the 49th Annual Meeting of the As-

sociation for Computational Linguistics: Human Language Technologies:

short papers-Volume 2, Association for Computational Linguistics, pp. 188–

193.


Acerca del autor

Carlos Gomez-Rodrıguez es Profesor Contratado

Doctor en la Universidade da Coruna. Su investigacion

se encuadra en el campo de la linguıstica computacional

y el procesamiento del lenguaje natural, centrandose so-

bre todo en analisis sintactico y sus aplicaciones, y abar-

cando tambien otros temas como la minerıa de opiniones

o la evolucion de la sintaxis de las lenguas humanas. Es

autor de un libro y 90 publicaciones con revision por

pares, incluyendo numerosos artıculos en los principales

congresos y revistas de linguıstica computacional.Es investigador principal de un proyecto estatal y del proyecto europeo FAST-

PARSE (Fast Natural Language Parsing for Large-Scale NLP), financiado por

una Starting Grant del European Research Council.

Indice - SEIO · 2018. 8. 8. · Bolet n de Estad stica e Investigacion Operativa Vol. 34, No. 2,...

Documents

Transcript of Indice - SEIO · 2018. 8. 8. · Bolet n de Estad stica e Investigacion Operativa Vol. 34, No. 2,...