koenigstuhl.geog.uni-heidelberg.dekoenigstuhl.geog.uni-heidelberg.de/.../2012/Hagenauer/m…  ·...

33
Mining Urban Land Use Patterns from Volunteered Geographic Information Using Genetic Algorithms and Artificial Neural Networks Julian Hagenauer and Marco Helbich GIScience, Institute of Geography, University of Heidelberg, Germany Keywords: Volunteered Geography, OpenStreetMap, Spatial Data Quality, Spatial Data Mining This is an Author’s Original Manuscript of an article whose final and definitive form, the Version of Record, has been published in the International Journal of Geographical Information Science [ 14 Nov 2011 ] [copyright Taylor & Francis], available online at: http://www.tandfonline.com/doi/abs/10.1080/13658816.2011.619501 . Abstract OpenStreetMap (OSM), as one of the most promising crowd sourced initiative, provides volunteered mapped spatial data. At once, this bears several spatial data quality problems, inter alia completeness, which on the one hand induces data omission errors and commission errors on the other hand. Using European-wide urban land use patterns, this study investigates the first issue and aims at predicting currently not mapped or partially mapped urban areas based on OSM. For this purpose, a machine learning approach consisting of genetic algorithms and artificial neural networks is applied to estimated urban areas. Under the premise of existing OSM data the model estimates missing urban areas with an overall squared correlation coefficient (R 2 ) of 0.6. Nevertheless, interregional 1

Transcript of koenigstuhl.geog.uni-heidelberg.dekoenigstuhl.geog.uni-heidelberg.de/.../2012/Hagenauer/m…  ·...

Mining Urban Land Use Patterns from Volunteered Geographic Information Using Genetic Algorithms and Artificial Neural Networks

Julian Hagenauer and Marco Helbich

GIScience, Institute of Geography, University of Heidelberg, Germany

Keywords: Volunteered Geography, OpenStreetMap, Spatial Data Quality, Spatial Data Mining

This is an Author’s Original Manuscript of an article whose final and definitive form,the Version of Record, has been published in the International Journal of GeographicalInformation Science [14 Nov 2011] [copyright Taylor & Francis], available online at:http://www.tandfonline.com/doi/abs/10.1080/13658816.2011.619501.

Abstract

OpenStreetMap (OSM), as one of the most promising crowd sourced initiative, provides volunteered

mapped spatial data. At once, this bears several spatial data quality problems, inter alia

completeness, which on the one hand induces data omission errors and commission errors on the

other hand. Using European-wide urban land use patterns, this study investigates the first issue and

aims at predicting currently not mapped or partially mapped urban areas based on OSM. For this

purpose, a machine learning approach consisting of genetic algorithms and artificial neural networks

is applied to estimated urban areas. Under the premise of existing OSM data the model estimates

missing urban areas with an overall squared correlation coefficient (R2) of 0.6. Nevertheless,

interregional comparisons of European regions confirm spatial heterogeneity in model quality (R2

ranges from 0.2 up to 0.8) and thus the inherent varying completeness of OSM. Hence, this verifies

the hypothesis that more active volunteers within a region enhance the content of OSM.

1 Introduction

The emergence of internet technologies facilitates the generation and distribution of manifold digital

content (e.g., Flickr, Wikipedia) and thus makes collaborative efforts common and present in

everyday life. The enablement of participatory collaboration caused a paradigm shift, blurring the

distinction between consumers and producers that has been existent in the Web since its early days.

O’Reilly (2005) terms these developments Web 2.0.1

Because of the high costs related to the process of gathering and maintaining as well as efforts

involved in sharing and distributing spatial data, these have been almost solely the domain of either

official land surveying offices or commercial companies. Nowadays, the availability of mobile devices

endowed with satellite navigation has enabled people to collect geographic data on their own, at low

costs, and high precision levels, formerly not conceivable for non-experts. Citizens have become

human sensors (Goodchild 2007). Web 2.0-technologies permit volunteers to aggregate, share and,

edit their collected geographic data in a collaborative manner. This phenomenon is usually referred

to as Volunteered Geographic Information (VGI; Goodchild 2007; Elwood 2008), whereas Sui (2008)

calls it in terms of GIS metaphorically the "wikification of GIS".

Among a broad list of initiatives dealing with VGI, OpenStreetMap (OSM) is one of the most

promising activities. Since its initiation in 2004, its primary goal is the generation of a free map of the

world through volunteered contributions. Although, the generation of maps is still the primary

intention, the collected spatial data is also made publicly available and may thus be used for

individual purposes (e.g., OpenRouteService.org (Neis and Zipf 2008), 3D city models (Over et al.

2010)). User generated GPS tracks, out of copyright maps, and more recently aerial images (e.g., Bing

Maps) serve as primary data source. The data itself are distributed under a license that guarantees

freedom of use, but enforces that all derived data are distributed under the same license (Haklay and

Weber 2008, Ramm and Topf 2010).

In general, awareness concerning the limitations of spatial data is essential and in particular data

quality issues, which are a comprehensive and ongoing active research field (e.g., Chrisman 1984;

Buttenfield 1993; Goodchild and Hunter 1996; Goodchild 1998; Shi et al. 2002, Devillers et al. 2010).

Evaluating spatial data quality is an important part in the process of assessing the "fitness for use"

(Chrisman 1993) of a data set for a particular application (van Oort 2003). Furthermore, spatial data

quality has also been addressed by a set of standard definitions and quality criteria proposals from

various organizations such as the National Committee on Digital Cartographic Data Standards

(NCDCDS, Moellering 1987), the International Cartographic Association (ICA, Guptill and Morrison

1995) or the International Standardization Organization (ISO, Kresse and Fadaie 2004). Based on the

definitions of the Technical Committee 211 of the ISO, the following elements of spatial data quality

can be identified (Kresse and Fadaie 2004):

- Positional accuracy: concerns the accordance of positioning and geometry of an object to its

representation in the real world,

- Attribute accuracy: measures the correctness of attributes assigned to a geographic object,

2

- Completeness: measures the absence and excess of data ,

- Logical consistency: describes the topological correctness and the relationships between

objects in respect to their internal consistency,

- Semantic accuracy: evaluates the correspondence of the interpretation of spatial objects to

their meanings in the real world,

- Temporal accuracy: describes the data actuality in relation to real world changes,

- Lineage: concerns the history of a data set, how it was collected and derived to its actual

state,

- Usage: assesses the extent of a data set to serve its intended purpose.

This contribution focuses on the completeness criterion of spatial quality. Completeness describes

the presence and absence of objects in a data set. Brassel et al. (1995) distinguish between data

completeness and model completeness. Former refers to the measureable errors between the data

set and its specification. Errors may be caused by lack of data that is formally expected to be present

in the database (omission errors) or otherwise, present in the database but not intended to be

included (commission errors). The latter is measurable and independent of the application. In

contrast to data completeness, model completeness considers the intended use, i.e. how well the

model of a dataset fulfills the requirements of an application (Brassel et al. 1995). To evaluate

completeness in terms of fitness for use, it is advisable to consider both data completeness as well as

model completeness. The appropriateness of spatial data quality usually depends on the availability

and quality of reference information (Servigne et al. 2010). Concerning VGI in general and OSM in

particular spatial data quality, namely spatial accuracy, is an important issue (Haklay et al. 2010) and

following Goodchild (2008) the same is valid for completeness.

Studies of OSM data quality have previously been conducted in several recent studies (Girres and

Touya 2010; Haklay 2010; Haklay et al. 2010). Haklay (2010) evaluates the positional accuracy of OSM

in reference to Ordnance Survey data for Great Britain based on the methodology of Goodchild and

Hunter (1996), by analyzing the percentage of overlaps between both data sets within a buffer

distance. Girres and Touya (2010) similarly evaluated different aspects of OSM data quality in France.

Both studies showed that in terms of positional accuracy the quality of OSM data is comparable to

traditional geographic datasets from national mapping agencies or commercial providers.

Nevertheless, comprehensive studies concerning completeness of OSM are lacking so far and thus

represent a compelling area of research, which is the main objective of this paper.

3

The measurement of VGI completeness is a complex task and bears several difficulties. First, VGI

activities address a large user-base, whose motivations for participating, contributing, and using

spatial data differ substantially (Heipke 2010). Second, a strictly defined dataset specification may be

contradicting the plurality of a user community, but without a precise specification of the dataset it is

not possible to detect errors of completeness (van Oort 2006; Servigne et al. 2010). Third, in OSM

anybody is permitted to add, delete, or modify data. However, mapping guidelines exist that are

recommended to be followed by contributors. The guidelines are communicated, discussed and

modified through a wiki, and reflect the consensus of the community. Because this specification is

not authoritative, it is not possible to measure completeness of OSM in a strict sense. Anyhow,

several studies comparing the completeness of OSM to other datasets exist (e.g., Girres and Touya

2010; Haklay 2010; Haklay et al. 2010; Zielstra and Zipf 2010). However, those studies only consider

objects of certain types (e.g., roads or rivers) for descriptive measurements.

For OSM it is not guaranteed that a certain object will ever be mapped. On a global scale the digital

divide is an important factor for incomplete mapping of less developed countries (Goodchild 2008;

Maue and Schade 2008). On a local scale the absence of voluntary contributors in disadvantaged

areas is a primary cause for omission errors (e.g., Haklay 2010). In particular, it appears that

completeness in the sense of present features not only depends on the density of information within

an area, but also on the number of contributors, which is generally high in densely populated areas

(Girres and Touya 2010). Third-party contributions to OSM, such as the import of the TIGER data

from the US Census Bureau and the availability of cadastral data from French authorities, improve

the situation of missing data. Beyond that, it is expected that intelligent tools and powerful

visualizations might further be helpful in detecting and fixing spatial data quality issues (e.g., see the

prototype of the web-based attribute visualization tool OSMatrix1 (Roick and Hagenauer 2011).

However, the absence of data may unbearably affect the fitness for use. For instance, incomplete or

wrongly mapped representations of road networks and the surrounding environment may induce

inaccurate route proposals of a navigation application (e.g., Neis and Zipf 2008). This in turn

negatively affects the users’ activity space and time schedule. In particular, this is true for larger

urban areas, affected by traffic jams, speed limits, and one-way streets, where it is often advisable to

take an alternative bypass route.

OSM exhibits implicit information that can be used to fill the gap between the contents of the data

set and information needed for the intended application. Urban areas are generally not delineated in

OSM, but there are various possibilities to derive them from existing data. A straightforward solution

1 http://osmatrix.uni-hd.de4

is to aggregate land use information to form urban areas. However, especially in sparsely mapped

areas and for small rural communities such land use information is mostly absent in OSM. Based on

the method of Rozenfeld et al. (2008), Jiang and Jia (2011) propose a clustering algorithm to derive

city boundaries from OSM street nodes. Their methodology aggregates street nodes within a certain

distance to clusters. The choice of the distance has a crucial affect on the result. Their approach

validates Zipf’s law, which relates the size of cities with their rankings. However, their approach

ignores other OSM data and inherent non-linear relationships. Furthermore, both approaches derive

only crisp urban areas, but in urban geography there is strong evidence that metropolitan areas are

nowadays shaped by sub- and postsuburbanization processes (e.g., Helbich and Leitner 2009, 2010),

causing a continuous transition between urban and rural areas, and cannot be delineated by a crisp

and dichotomous classification scheme (Leung 1987). A continuously and density-based

representation of urban areas is more suitable but is difficult to obtain due to complex relationships

of urbanization processes.

Developments in artificial intelligence find remedy and bear potential to solve geographical problems

that were previously difficult to solve (Smith 1984, Gahegan 2003). In particular, Artificial Neural

Networks (ANNs) are appealing for spatial analysis (Openshaw and Openshaw 1997), because of their

computational speed, representational flexibility, ability to model non-linear relationships, and

computational adaptivity (Fischer 1997). ANNs perform particularly well compared to conventional

statistical models if the data are incomplete or inconsistent (Fischer and Gopal 1994), which is often

the case with complex spatial data such as OSM. In GIScience, ANNs have already shown high

potential for modeling complex geographic processes (e.g., Fischer and Gopal 1994; Pijanowski et al.

2002; Mas et al. 2004).

This paper makes an initial empirical contribution and charts current urban patterns on the basis of

VGI. Therefore, the objective of this study is to develop a density-based methodological framework

to delimitate continuous urban areas using the whole information diversity of OSM. To capture

possible non-linear relationships, interactions, and spatial effects within the GIS-based data, ANN

techniques are applied. The framework mitigates data completeness issues of OSM and thus helps to

improve the fitness for use of OSM for individual applications. The usefulness is demonstrated on a

set of selected European urban regions.

The paper is structured as follows: Section 2 provides an overview concerning the study area and the

data sets. Section 3 introduces the methodology used to detect urban patterns. Results of the

5

empirical analysis are discussed in section 4 and the paper concludes (section 5) with a discussion of

the results and identifies future work.

2 Materials

2.1 Study Site and Data

Training of ANNs requires reference information for learning. For the European Union (EU), two

publicly available urban land use datasets are predominant: CORINE (Coordination of information on

the environment) Land-Cover (CLC) data and Global Monitoring or Environment and Security Urban

Atlas (GMESUA) data. The former has the advantage that it is fully available for the whole territory of

the EU but has a minimum mapping unit of 25 hectare and a minimum width of linear elements of

100 meters. Therefore, it is only suitable for small scale mapping applications. The second

alternative, the GMESUA data, are a product of a joint initiative of the European Commission and the

European Space Agency. During the first quarter of 2011, The GMESUA data set covers 242 urban

regions within Europe, which differ in socio-economic and demographic factors.

The acquisition of GMESUA is based on SPOT-5 satellite images with a 10 m multispectral and 2.5 m

panchromatic pixel resolution. The multispectral data includes a near-infrared band. Compared to

CLC, the data has a considerably finer resolution: linear elements with a width of 10 m are mapped

and the minimum mapping unit for urban areas is 0.25 and 0.55 ha for non-urban areas. Thereby, 44

different land use categories are distinguished. The advantage of high spatial and thematic resolution

is reduced by the fact that the dataset is only available for selected urban areas with more than

100.000 habitants in 27 different countries at a scale of 1:10.000. Figure 1 illustrates the selected

urban regions. A random sampling was indispensable to keep computation feasible and consists of a

subset of about 20%, corresponding to 42 regions, to generate the training and validation data set.

6

Figure 1 Countries covered by GMESUA (dark gray) and the 42 randomly selected GMSEUA urban

regions (black).

3 Methodology

The proposed research design for estimation of urban land use patterns comprises application of

three major consecutive steps:

1. Data preparation: As a first step it is necessary to prepare the OSM and GMESUA data such

that both are valuable for model building. In particular, a large set of potential attributes are

derived from OSM for inductive learning and the desired output is calculated from GMESUA

(Section 3.1).

2. Selection and model building: Second, a genetic algorithm (GA) is used to reduce the total set

of attributes to a reasonable subset and an artificial neural network (ANN) is trained with

these subset (Section 3.2 and 3.3).

7

3. Sensitivity analysis and model performance: Finally, due to the unknown contribution of the

attributes to the model, their significance is analyzed (Section 3.4) and the model

performance for the different areas are investigated.

3.1 Data preparation

Training data for ANNs consists of a set of training samples. Each sample is a pair of an input vector

and a desired output. Therefore, it is necessary to derive input vectors from OSM data and the

desired output from the reference GMSEUA data set to learn the intra-relationship between them.

However, both datasets consist of manifold and diverse information that need to be aggregated to a

normalized representation, where the choice of the areal units for aggregation is crucial and possibly,

like regression analysis (Fotheringham and Wong 1991), affected by the modifiable areal unit

problem (Openshaw 1984). In this study, aggregation is carried out on a hexagonal raster

representation. This seems more reasonable than squarely shaped cells because hexagonal shapes

can better imitate European urban patterns at every scale. The side length of every hexagonal cell is

250 m for this European-wide analysis, which seems a trade-off between computational burden and

spatial resolution, but allows the derivation of fine scaled urban patterns.

3.1.1 GMESUA Urban Regions as Desired Output

GMESUA subsumes continuous and discontinuous urban fabric of built-up areas and its respective

associated land according to its primary use. The latter is further distinguished between degrees of

soil sealing (EEA 2010). To derive urban areas from land use classification, reclassification of the

original data is required, which is an ambiguous and subjective process. While for few classes of

GMESUA a clear distinction between urban and non-urban areas is obvious (e.g., category 1.1.1:

continuous urban fabric with sealing level above 80%), most classes require additional information to

achieve a clear class membership. Here, primary aerial images and local knowledge were used to

reclassify the GMESUA datasets. For each cell the overlap between the cell and the resulting urban

areas is computed and assigned as an attribute, representing the desired output for the ANN.

3.1.2 Derivation of OSM Attributes (Input data)

Basically, OSM presents three different types of information: geometric information, attributive

information, and meta-data (Ramm and Topf 2010). These types potentially contain implicitly or

8

explicitly information that can be used for urban pattern detection. A spatial object in OSM is

characterized by its geometric primitive and a set of assigned tags. A tag is a pair of a key and

additional values that represent the attributive information of a specific object, e.g. a linear geometry

with the assigned tags highway =”primary” and oneway=”true” describes a major highway that is

only accessible in one direction. Although, the data model of OSM is strictly specified, in the sense

that every user is permitted to assign arbitrary tags to any object.

It cannot be expected to significantly improve the total model performance by including information

about sparsely mapped objects, but instead bears the risk of overfitting. Highways and places are

assumed to be generally well mapped for most regions. However, due to the freedom of users to

assign keys and tags at will, the potential number of different highway and place categories is

arbitrarily large and thus requires certain generalization. Therefore, highways and places with a

reasonable high occurrence in OSM2 are exclusively considered. Most objects in OSM have meta-

information assigned (e.g., the time of the last edit or the name of the user that has modified an

object at last). Hacklay (2010) indicated that mapping habits of people within urban areas differ from

people residing in rural areas. Thus, it can be expected that this difference is reflected in the meta-

information of spatial objects. The basic descriptive statistics (e.g., minimum, maximum, average) are

calculated from the meta-information of all considered objects raster cell. According to concept of

spatial autocorrelation, geographically close observations depend on each other (Tobler 1970). Thus,

the distance to an object predominantly found in urban areas relates to the urbanization of an actual

raster cell and implicitly includes autocorrelated processes in the model. Hence, it is reasonable to

comprise the nearest distance to different objects, e.g. nearest highways, as an attribute for each

raster cell. For derivation of geometric and topologic attributes, it is necessary to distinguish between

the geometric primitives of OSM. Of special interest are the properties of lines, mostly representing

roads. It is hypothesized that urban areas show a higher amount of total road length, junctions,

curves, and right angles because of the necessity of dense traffic infrastructures in densely populated

areas, and are consequently included as the raster cell’s attributes. Further, graph centrality

(Nieminen 1974) measures the number of nodes that link a given node. Previous studies by Yang and

Harry (2004) and Bak et al. (2010) have documented the capability of this index for street network

analysis. In conclusion, table 1 gives an overview of the 102 derived statistics and attributes for each

cell.

2 Frequently used tags: a) highway-tags: residential, unclassified, tertiary, secondary, primary, motorway, motorway_link, steps, trunk, path, track, footway, service, living_street, cycleway; b) place-tags: town, hamlet, village, suburb, locality.

9

Table 1 Derived OSM attributes for each cell 3

Attributes for

selected

highway types

Aggregated attributes

for selected highways

Attributes for

selected place

types

Aggregated attributes for

selected highways and places

Length Number of junctions Number points Number of objects

Distance Number of junctions with at

least one right angle curve

Distance Min./Max./Avg. version

number(s)

Curviness Number or roads with right

angle curves

Earliest/Latest/Avg. time of

modification(s)

Number of

waypoints

Number of road endings Total/Min./Max./Avg. number

of user contributions

Min./Max./Total/Avg.

angle(s)

Min./Max./Total/Avg. number

of object tags

Min./Max./Total/Avg.

centrality

3.2 Genetic Algorithm for Attribute Selection

As outlined in Section 2 several cell-based OSM attributes are calculated. The performance of ANN

models when learning a regression function depends on the choice of attributes (Pyle 1999). The

attributes implicitly define a pattern language (Yang and Honavar 1997). If the language is not

expressive enough, a model will fail to capture the information necessary to approximate the target

function. Contrary, if the language is too expressive, the computational time to learn the model

increases and is vulnerable to overfitting. Due to the large number of attributes and the non-linear

relationships between the attributes, heuristics are promising to obtain near-optimal attribute sets

for ANN model building (Siedlecki and Sklansky 1989).

Attribute selection approaches can be categorized as follows (Liu and Yu 2004): The filter approach

uses statistics to measure the relevance of attributes. It is totally independent of the learning

algorithm, thus it is computationally more efficient than the wrapper approach, which involves

computational overhead by executing the training algorithm for every presented attribute set and

evaluating the results. Attribute selection may not be independent of the learning algorithm, which is

ignored by the filter approach. In contrast, the wrapper approach takes the properties and biases of

the inductive learning algorithm into account. Because the wrapper approach is generally

3 For selected highways and places see 210

computational demanding, genetic algorithms (GA) are especially promising for attribute selection

(Siedlecki and Sklansky 1989).

A GA is a heuristic optimization method, simulating natural evolution processes in analogy to biology

(Holland 1975; Goldberg 1989). GAs represent a potential solution as an individual. Each individual is

encoded by a chromosome, comprised of a set of genes. The chromosomes are often coded as a

binary string. A set of individuals constitutes a population. This population iteratively evolves, until a

stop criterion is reached. At each iterative step of the GA the fitness of the individuals of the current

population is measured. Afterwards, the population for the next iterative step is built by selecting,

recombining, and mutating the most promising individuals of the current population (Mitchell 1998).

Because only promising individuals take part in the evolutionary process, it is likely that near-optimal

solutions emerge. The final solution is chosen from the individuals of the last population. In contrast

to gradient-decent optimization, multiple solutions are maintained in parallel within a population,

allowing interactions among them to explore regions in the search space between them (Qi et al.

1994). The goodness of a solution, respectively fitness of an individual, is numerically evaluated by a

fitness function, which depends on the optimization objective. To utilize GAs for attribute selection

following the wrapper approach, it is necessary to represent different attribute sets as individuals.

For each individual of a population an ANN is trained and its performance is measured, representing

the goodness of the individual.

3.3 Artificial Neural Network

Artificial neural networks (ANNs) model an interconnected system of neurons, enabling computers to

imitate the brain’s ability to detect patterns and learn relationships within data (Fischer 1998). The

multi-layer perceptron (MLP), introduced by Rumelhart et al. (1986), is one of the most widely used

ANNs (e.g., Fischer and Gopal 1994; Pijanowski et al. 2002; Mas et al. 2004). The MLP usually consists

of three different layers of neurons: the input layer, the hidden layer, and the output layer. Every

connection between neurons of different layers has an assigned weight, scaling input signals passing

through. The input data is first presented to the input layer, and then subsequently passed to the

hidden layer and to the output layer in a feed forward manner. A neuron receiving the weighted

signals from connected neurons of the preceding layer, sums the signals, and calculates an output

signal according to its inner activation function (Bishop 1996).

The crucial part of ANNs is the adaption of the weights, so that the model is capable to represent a

target function. The most popular way of training an ANN is by modifying its weight using the back

11

propagation algorithm (Rumelhart et al. 1986). This algorithm randomly sets the weights and

calculates the resulting output. After all data samples are presented to the network, the sum of the

mean squared error is calculated and the weights are modified according to a generalized delta rule

(Rumelhart et al. 1986), so that the total error is distributed among the various nodes in the network.

This process of feeding forward input signals and back propagating the errors is repeated iteratively,

until a terminating condition is fulfilled (e.g. the error-rate falls below a certain threshold).

It has been shown that ANNs with one hidden layer can theoretically approximate any function

(Hornik et al. 1989). However, a certain degree of freedom must be supplied, i.e. the layer must

consist of a sufficient number of hidden neurons. Generally, the number of hidden neurons is chosen

to minimize a trade-off between network bias and variance (Bishop 1995). A limitation in the use of

ANNs is that they provide a “black box” model. It is difficult to gain deep insight into the interior

working of an ANN by interpreting the weights of the network. Nevertheless, numerical analysis of

different input settings may help to gain insights into the importance of attributes. One primary

advantage of ANNs, compared to the more easily interpretable decision trees, is the ability to model

unknown interactions between input variables, the relationship between such interactions, and any

output pattern (Pyle 1999).

3.4 Significance Analysis

Although the total set of input variables for ANN training are reduced by a GA, the relative

contribution of the remaining attributes on the total model performance is not known. However,

being aware of the importance of the attributes can advance the understanding of the model and its

explanatory capabilities. To evaluate the relative contribution of each attribute to the output, several

methods have been proposed (Gevrey et al. 2003; Olden et al. 2004).

Because of the convergence of the GA to an optimal solution, the genetic diversity of the individuals

within a generation is generally decreasing. Consequently, it is hypothesized that genes important to

the survival of individuals are present in most chromosomes of individuals within a generation, while

unimportant genes are spare and diverse. Thus, by counting the frequency of the attributes

represented within a generation, the importance of the attributes to the output of the ANN can be

estimated. Another method to measure the importance of an attribute is to measure the change of

root mean square error (RMSE) when sequentially and stepwise setting input neurons to their mean

value (SSMV). The resulting change indicates the relative importance of each attribute (Gevrey et al.

12

2003). Because the two techniques can lead to diverse conclusions, both are applied and compared

in the next section.

4 Results

4.1 Model Specification and Overall Quality

For variable selection purposes a non-dominated sorting genetic algorithm (NSGA-1; Srinivas and Deb

1994) was used. The algorithm was allowed to run at most 1000 iterations, but stops earlier if the

performance for 25 iterations does not significantly improve. Each generation consisted of 100

individuals. To reduce the computation time and to limit the resulting model to a reasonable size, the

maximum number of attributes was set to 20 for each individual and is represented by a trained ANN

model. The training data consist of a subset of 20,000, cells (4% of the whole dataset), randomly

selected from all regions of the dataset (see Sec. 2). The remaining cells are used for testing and

validation. After empirical tests, the final ANN consists of a single hidden layer with (n+1)/2 hidden

neurons, where nis the number of attributes, even though the optimal number of hidden neurons is

not known. The ANN is trained for 1,000 cycles by backpropagation with a learning rate of 0.3. The

final model is selected from the last GA generation, based on the RMSE, the squared correlation

coefficient (R2), Spearman’s rho (RS), and the number of attributes. The resulting model of the GA

optimization consists of 11 remaining attributes (distance to nearest residential road, length of

residential roads, number of waypoints of primary roads, length of motorways, cycleway curviness,

distance to nearest pedestrian road, distance to nearest track, length of tracks, number of junctions,

distance to nearest village, and number of attributes). Overall, the model showed moderate

performance with a RMSE of 0.12, a R2 of 0.6, and a RS of 0.59. Applying the model on the remaining

data, which are independent of model building, allows the investigation of its generalization

capabilities. The result yields a similar performance with a RMSE of 0.12, a R2 of 0.59, and a RS of

0.58. A further residual inspection shows a mean of -0.05 and standard deviation of 1.03 and

confirms a nearly Gaussian distribution. Thus, it attests full model capability.

4.2 Regional Model Performance

Due to the spatial heterogeneity of geographic processes, it is assumed that the performance of local

models changes if applied to distinct areas. Reasons are, on the one hand, differences in

urbanization, economic power, as well as cultural issues, and, on the other hand, the varying quality

of OSM data underlying the model. To assess the influence of locality to the model performance, it is

13

applied to each GMESUA region and its generalization capabilities are examined separately. The

results are summarized in Table 2.

Table 2 Model performance for selected regions within the study area

Urban Region GMESUA Regions4 % of cells

intersecting

urban areas

RMSE R2 RS

Linz At003l 48.8 0.121 0.695 0.650

Varna Bg003l 28.7 0.194 0.353 0.554

Ruse Bg006l 18.4 0.125 0.440 0.483

Brno Cz002l 31.6 0.114 0.646 0.637

Ústí nad Labem Cz005l 35.9 0.127 0.583 0.649

Jihlava Cz014l 21.3 0.070 0.700 0.628

Leibzig De008l 38.6 0.135 0.627 0.700

Bremen De012l 36.0 0.103 0.709 0.626

Darmstadt De025l 35.8 0.126 0.757 0.710

Mönchengladbach De036l 80.4 0.186 0.701 0.837

Koblenz De042l 37.8 0.118 0.743 0.720

Odense Dk003l 45.3 0.118 0.580 0.476

Valencia Es003l 53.1 0.185 0.534 0.678

Toledo Es016l 12.1 0.075 0.337 0.275

Cordoba Es020l 16.5 0.108 0.538 0.492

Bordeaux Fr007l 45.0 0.167 0.528 0.628

Rennes Fr013l 55.0 0.119 0.590 0.449

Besançon Fr025l 31.4 0.101 0.632 0.666

Patrai Gr003l 26.1 0.110 0.667 0.620

Miskolc Hu002l 24.1 0.167 0.423 0.500

Debrecen Hu005l 25.1 0.185 0.313 0.423

Győr Hu007l 24.8 0.154 0.433 0.553

Firenze It007l 40.5 0.117 0.647 0.629

Bologna It009l 42.1 0.137 0.463 0.489

Trieste It015l 57.1 0.169 0.590 0.773

Panevėžys Lt003l 15.8 0.121 0.210 0.284

4 This code allows a clear assignment to the data sets. The first two characters correspond to the particular country according to the ISO 3166 standard.

14

Luxembourg Lu001l 32.1 0.095 0.683 0.678

Liepāja Lv002l 0.08 0.060 0.361 0.202

Rotterdam Nl003l 62.1 0.193 0.598 0.755

Tilburg Nl006l 60.1 0.206 0.509 0.678

Wrocław Pl004l 29.2 0.129 0.450 0.559

Poznań Pl005l 32.0 0.122 0.569 0.599

Setúbal Pt006l 57.3 0.236 0.155 0.390

Aveiro Pt008l 52.8 0.242 0.129 0.340

Brăila Ro008l 17.6 0.123 0.676 0.484

Călăraşi Ro012l 19.1 0.163 0.442 0.480

Umeå Se005l 0.07 0.054 0.428 0.225

Banská Bystrica Sk003l 17.4 0.072 0.710 0.570

Žilina Sk006l 26.8 0.109 0.589 0.597

Sheffield Uk010l 47.5 0.125 0.789 0.781

Leicester Uk014l 47.9 0.129 0.742 0.639

Wolverhampton Uk028l 53.6 0.158 0.728 0.715

For most regions the model performs comparably well, with a few remarkable exceptions. For

instance, the region of Liepāja, Latvia, has a very low RMSE (0.060), which means that the model

mostly predicts well. However, the low R2 (0.361) and RS (0.202) indicate, that the low RMSE is not a

result of a good prediction of urban areas, but certainly caused by the sparse urbanization in that

region. The largest city in this region is Liepaja with a population of about 60,000, and only 3 further

towns are located in that region with more than 1,000 residents. In fact, only 8% of all cells are

covered at least partially by urban areas. Additionally, OSM data is mostly absent, except for the city

Liepaja itself, thus the model necessarily failed to predict urban areas.

As a second example, the region of Aveiro, Portugal, has an exceptional high RMSE (0.242). The high

RMSE is caused by a discrepancy between a high density of urbanization and sparse OSM data in that

region. Actually, 53% of the cells for that region are at least partially covered by urban structures.

Figure 2 depicts a section from this region, including the city of Ílhavo. Ílhavo has approximately

16,800 residents. However, virtually no OSM data are mapped for this city (Fig. 2 lower panel). Only

the place and a single primary highway exist, although the comparison to GMSEUA reveals, that the

region is densely settled (Fig. 2 upper panel). Such a discrepancy is, of course, reflected in the

performance of the model.

15

Figure 2 Urban areas of Aveiro (Portugal) according to GMESUA (upper panel) and OSM data (lower

panel)

However, for several regions the model performs exceptionally well. Especially, for the region of

Jihlava, Czech Republic, the model results in a very low RMSE (0.07) and also a considerably high R2

(0.7) and RS (0.682), even though 21% of the cells are at least partially covered by urban structures

The good performance for this region indicates that the quality of the OSM data suffice the data

requirements of the ANN model.

The OSM data quality of the United Kingdom is well-understood and it is empirically verified that at

least urban areas are well-covered in the sense of completeness (Hacklay 2010; Hacklay et al. 2010).

These results are also supported by this study as well. Generally, high R2 and RS for all urban regions

of the United Kingdom are revealed. One stated key motivation of this work is to improve

completeness of OSM by machine learning, and thus improve its fitness for use. The developed

16

model is able to predict urban patterns, based on a set of attributes derived from OSM data. Figure 3

(lower panel) shows a successful prediction of urban patterns for Great Glen (United Kingdom),

where the model distinguishes different degrees of urbanization based on the underlying OSM data.

Figure 3 Comparison between the real urban pattern based on GMESUA (upper panel) and the

predicted results (lower panel) for Great Glen, south-east of Leicester, United Kingdom.

4.2 Significance Analysis

To gain further insights into the model, the importance of the models’ attributes is evaluated. The respective ranks for the two proposed methods (Sec. 3) are presented in table 4.

Table 3 Ranking of attribute importance by SSMV and GA

Attribute SSMV GA

17

Change of RMSE Rank Frequency Rank

Distance to nearest residential road 0.0195 1 100 1

Length of residential roads 0.0196 2 100 1

Number of Attributes 0.0103 3 90 6

Number of junctions 0.0095 4 32 9

Distance to nearest village 0.0080 5 85 7

Distance to nearest pedestrian road 0.0054 6 92 5

Length of tracks 0.0001 7 100 1

Number of waypoints of primary roads -0.0001 8 7 11

Distance to nearest track -0.0003 9 99 4

Length of motorways -0.0006 10 23 10

Cycleway curviness -0.0014 11 58 8

The results for both methods are widely comparable. Both identify the length of residential roads

and their distance as most significant input variables. Both methods are also comparable at

identifying less important attributes, except the number of junctions and distance to nearest tracks,

whose ranks are complementary. For SSMV it is notable, that attributes with a rank less than 7 are

rather unimportant because setting them stepwise to constant values even improved the results.

However, there exist also significant differences in the measurement of importance. SSMV ranked

the number of attributes as very important, while for GA this attribute is only of average importance.

Another major difference between SSMV and GA is the ranking of track-related attributes. For GA the

length of tracks and the distance to the nearest track is very important, while for SSMV these

attributes are rather irrelevant. Furthermore, the importance of junctions is not emphasized by GA.

4.3 Discussion

Data production in VGI is carried out by many independent contributors, increasing the chance for

occurring repetitions. It can be expected that a large number of repetitions increases the chance of

errors being detected and fixed in the data, and thus improves the data quality in general (Heipke

2010). For OSM data Haklay et al. (2010) showed, that there is in fact a non-linear relationship

between the number of contributors and the positional accuracy. This assumption holds not for

every aspect of spatial data quality. For completeness it is anticipated, that the number of omission

errors negatively correlates with the number of commission errors because of the diversity of user

requirements. However, the proposed methodology is based on neural computing and aims to detect

18

urban areas using OSM data. It enables users to derive information implicitly stored in OSM,

superseding explicit storage, and thus making commission errors less likely.

The performance of the developed model varied considerably applied to the 42 different European

regions, emphasizing the spatial heterogeneity of the data. Local models may provide better results

for individual geographic regions, but do not capture general properties of the data well. Although

the developed model depends on data that is mostly mapped in OSM, it necessarily fails, if such data

is missing at all. Therefore, to apply the presented methodology it is necessary to make assumptions

on the presence of valuable information. Delineating urban patterns is subject to individual

considerations. Thus, the validity of the presented methodology for predicting density-based

patterns of urban areas is a difficult task, although the results are coherent and visually appealing. In

general, the validity of the results must be evaluated with respect to the intended application.

Machine Learning can only respond to geographic characteristics if those are encoded in the data

(Gahegan 2003). It is a priori unclear how to encode information in the dataset to give reasonable

results. Furthermore, the settings of the ANN and GA in this study are primarily chosen a priori.

However, a detailed inspection and sensitivity analysis of the different settings bear potential for

making the model more robust and yield better results (see Patuelli et al. 2010). A fundamental

problem of spatial data analysis on lattice data is the dependence of the results on the underlying

scale and zoning (Openshaw 1984). The influence of the cell size on the results of prediction was not

subject of this study, but needs more research.

6 Conclusions and Future Work

In this study a methodological framework is proposed for the delineation of continuous urban areas

from OSM data. Each of the framework’s components provides solely distinct capabilities. The ANN

enables the non-linear estimation of urban patterns from a set of attributes. The GA reduces the total

set of attributes to a reasonable size for inductive learning following the wrapper approach, thus that

no precise understanding of the processes is mandatory. The usefulness of the methodology was

demonstrated by applying it to the estimation of urban patterns of Europe.

It was shown, that urban patterns can be predicted to a large extend. Explicit information about

urban patterns and urban density is useful for manifold applications, e.g. navigation and spatial

planning tasks, but is currently mostly absent in OSM. By estimating the urban density with the

19

proposed framework the fitness for use in OSM for applications is improved, leading to additional

benefit for users.

Future work will elaborate on the application of machine learning to OSM data. OSM offers a rich and

manifold source of spatial data, offering a profound pool of implicit information. Extracting this

information from OSM and making it explicit improves the fitness for use and overwhelm

completeness issues in the data for manifold applications and requirements. Additionally, machine

learning bears potential for detection of other data quality issues, e.g. positional or semantic errors.

From a methodological point of view it seems fruitful to include further machine learning techniques

that in particular take into account the spatial and temporal properties of OSM data.

Acknowledgements

We acknowledge the valuable feedback received from our colleagues Georg Walenciak, Marcus Götz,

Bernhard Höfle, and Alexander Zipf on an early draft of this paper. In addition, we want to thank

Hannes Taubenböck (German Aerospace Center, DLR) for his useful hints for appropriate remote

sensing data.

Reference

Bak P, Omer I and Schreck T 2010 Geospatial thinking. Painho, M, Santos, M Y & Pundt, H, (eds) Geospatial Thinking. Springer, Berlin: 25-42

Bishop C M 1996 Neural Networks for Pattern Recognition. New York, Oxford University Press

Brassel K, Bucher F and Stephan E-M 1995 Elements of Spatial Data Quality. . Pergamon, Oxford: 81-108

Buttenfield B 1993 Representing data quality. Cartographica: The International Journal for Geographic Information and Geovisualization 30: 1-7

Delavar M R and Devillers R 2010 Spatial data quality: From process to decisions. Transactions in GIS 14: 379-386

Devillers R, Stein A, Bédard Y, Chrisman N, Fisher P and Shi W 2010 Thirty years of research on spatial data quality: Achievements, failures, and opportunities. Transactions in GIS 14: 387-400

EEA 2010 Mapping guide for a European urban atlas. WWW document, http://www.eea.europa.eu/data-and-maps/data/urban-atlas/mapping-guide

Elwood S 2008 Volunteered geographic information: Key questions, concepts and methods to guide emerging research and practice. GeoJournal 72: 133-135

20

Fischer M M 1998 Computational neural networks: A new paradigm for spatial analysis. Environment and Planning A 30: 1873-1891

Fischer M M and Gopal S 1994 Artificial neural networks: A new approach to modelling interregional telecommunication flows. Journal of Regional Science 34: 503-527

Fotheringham A S and Wong D W S 1991 The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A 23: 1025-1044

Gahegan M 2003 Is inductive machine learning just another wild goose (or might it lay the golden egg)? International Journal of Geographical Information Science 17: 69-92

Gevrey M, Dimopoulos I and Lek S 2003 Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling 160: 249-264

Girres J-F and Touya G 2010 Quality assessment of the French OpenStreetMap dataset. Transactions in GIS 14: 435-459

Goldberg D E 1989 Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA, Addison-Wesley Longman Publishing Co., Inc.

Goodchild M F 2008 Spatial accuracy 2.0. In Proceedings of the Eighth International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Liverpool, World Academic Union: 1-7

Goodchild M F 2007 Citizens as sensors: The world of volunteered geography. GeoJournal 69: 211-221

Goodchild M F and Hunter G J 1997 A simple positional accuracy measure for linear features. International Journal of Geographical Information Science 11: 299-306

Haklay M 2010 How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets. Environment and Planning B: Planning and Design 37: 682-703

Haklay M, Basiouka S, Antoniou V and Ather A 2010 How many volunteers does it take to map an area well? The validity of Linus’ law to volunteered geographic information. The Cartographic Journal 47: 315-322

Haklay M and Weber P 2008 OpenStreetMap: User-generated street maps. IEEE Pervasive Computing 7: 12-18

Heipke C 2010 Crowdsourcing geospatial data. ISPRS Journal of Photogrammetry and Remote Sensing 65: 550-557

Helbich H and Leitner M 2010 Postsuburban spatial evolution of Vienna's urban fringe: Evidence from point process modeling. Urban Geography 31: 1100-1117

Helbich M and Leitner M 2009 Spatial analysis of the urban-to-rural migration determinants in the Viennese metropolitan area. A transition from suburbia to postsuburbia?. Applied Spatial Analysis and Policy 2: 237-260

21

Holland J 1992 Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Cambridge, MA, MIT Press

Hornik K, Stinchcombe M and White H 1989 Multilayer feedforward networks are universal approximators. Neural Networks 2: 359-366

Jiang B and Harrie L 2004 Selection of streets from a network using self-organizing maps. Transactions in GIS 8: 335-350

Jiang B and Jia T 2011 Zipf’s law for all the natural cities in the United States: A geospatial perspective. International Journal of Geographical Information Science (accepted)

Kresse W and Fadaie K 2004 ISO Standards for Geographic Information. Berlin, Springer

Leung Y 1987 On the imprecision of boundaries. Geographical Analysis 19: 125-151

Mas J F, Puig H, Palacio J L and Sosa-López A 2004 Modelling deforestation using GIS and artificial neural networks. Environmental Modelling & Software 19: 461-471

Maué P and Schade S 2008 Quality of geographic information patchworks. In 11th AGILE International Conference on Geographic Information Science.

Neis P and Zipf A 2008 OpenRouteService.org is three times "Open": Combining OpenSource, OpenLS and OpenStreetMaps. In GIS Research UK. Manchester

Nieminen J 1974 On centrality in a graph. Scandinavian Journal of Psychology 15: 322-336

Olden J D, Joy M K and Death R G 2004 An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling 178: 389-397

van Oort P 2006 Spatial Data Quality: From Description to Application. PhD thesis, Wageningen University

Openshaw S and Openshaw C 1997 Artificial Intelligence in Geography. New York, NY, Wiley

Over M, Schilling A, Neubauer S and Zipf A 2010 Generating web-based 3D city models from OpenStreetMap: The current situation in Germany. Computers, Environment and Urban Systems 34: 496-507

Patuelli R, Reggiani A, Nijkamp P and Schanne N 2011 Neural networks for regional employment forecasts: Are the parameters relevant? Journal of Geographical Systems 13: 67-85

Pijanowski B C, Brown D G, Shellito B A and Manik G A 2002 Using neural networks and GIS to forecast land use changes: A land transformation model. Computers, Environment and Urban Systems 26: 553-575

Pyle D 1999 Data preparation for data mining. San Francisco, CA, Morgan Kaufmann Publishers Inc.

Qi X and Palmieri F 1994 Theoretical analysis of evolutionary algorithms with an infinite population size in continuous space. Part I: Basic properties of selection and mutation. Neural Networks, IEEE Transactions on 5: 102-119

22

Ramm F and Topf J 2010 OpenStreetMap: Using and Enhancing the Free Map of the World. UIT Cambridge

Roick O and Hagenauer J 2011 OSMatrix - Grid-based analysis and visualization of OpenStreetMap.State of the Map 2011 The 5th Annual International OpenStreetMap Conference, Vienna, Austria (submitted)

Rozenfeld H D, Rybski D, Gabaix X and Makse H A 2009 The area and population of cities: New insights from a different perspective on cities. National Bureau of Economic Research, Inc., NBER Working Papers No 15409

Rumelhart D E, Hinton G E and Williams R J 1986 Learning representations by back-propagating errors. Nature 323: 533-536

Servigne S, Lesage N and Libourel, T 2010 Quality Components, Standards, and Metadata Fundamentals In Devillers R and Jeansoulin R (eds) Fundamentals of Spatial Data Quality. London, ISTE: 179-210

Shi W, Goodchild M and Fisher P 2002 Spatial data quality. New York, Taylor & Francis

Siedlecki W and Sklansky J 1989 A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10: 335-347

Smith T R 1984 Artificial intelligence and its applicabilty to geographical problem solving. The Professional Geographer 36: 147-158

Srinivas N and Deb K 1994 Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 2: 221-248

Tobler W R 1970 A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234-240

Yang J and Honavar V 1998 Feature subset selection using a genetic algorithm. Intelligent Systems and their Applications, IEEE 13: 44-49

Zielstra D and Zipf A 2010 A comparative study of proprietary geodata and volunteered geographic information for Germany. 13th AGILE International Conference on Geographic Information Science, Guimaraes, Portugal

Guptill S C and Morrison J L (eds) 1995 Elements of Spatial Data Quality. Oxford, Pergamon

Moellering H (ed) 1987 A Draft Proposed Standard for Digital Cartographic Data. Columbus, OH, National Committee for Digital Cartographic Standards

23