A Big Data Urban Growth big dataSimulation at a National Scale

19
A big data urban growth simulation at a national scale: Conguring the GIS and neural network based Land Transformation Model to run in a High Performance Computing (HPC) environment Bryan C. Pijanowski a, * , Amin Tayyebi a,b , Jarrod Doucette a , Burak K. Pekin a,c , David Braun d,e , James Plourde a,f a Department of Forestry and Natural Resources, Purdue University, 195 Marsteller Street, West Lafayette, IN 47907, USA b Department of Entomology, University of Wisconsin, Madison, WI 53706, USA c Institute for Conservation Research, San Diego Zoo Global, 15600 San Pasqual Valley Road, Escondido, CA 92027, USA d Rosen Center for Advanced Computing, Information Technology Division, Purdue University, West Lafayette, IN 47907, USA e Thavron Solutions, Kokomo, IN 46906, USA f Worldwide Construction and Foresy Division, John Deere, 1515 5th Avenue, Moine, IL, 61265, USA a r t i c l e i n f o  Article history: Received 5 April 2013 Received in revised form 18 September 2013 Accepted 23 September 2013 Available online 7 November 2013 Keywords: Land use land cover change Big data simulation Land Transformation Model High Performance Computing Extensible Markup Language Python environment Visual Studio 10 (C#) Continental scale a b s t r a c t The Land Transformation Model (LTM) is a Land Use Land Cover Change (LUCC) model which was originally developed to simulate local scale LUCC patterns. The model uses a commercial windows-based GIS program to process and manage spatial data and an arti cial neural network (ANN) prog ram within a series of batch routines to learn about spatial patterns in data. In this paper, we provide an overview of a redesigned LTM capable of running at continental scales and at a  ne (30m) resolution using a new archit ecture that employ s a windows-bas ed High Performance Computi ng (HPC) cluster . This paper provides an overview of the new architecture which we discuss within the context of modeling LUCC that requires: (1) using an HPC to run a modi ed version of our LTM; (2) managing large datasets in terms of size and quantity of  les; (3) integration of tools that are executed using different scripting languages; and (4) a large number of steps necessitating several aspects of job management.  2013 Elsevier Ltd. All rights reserved. 1. Introduction The Land Transfor matio n Model was developed over  fteen years ago (Pi janowski et al. , 1997, 2000, 2002a,b ) to si mul ate spatial patterns of land use land cov er change (LUCC) over time. The model uses geographic information systems (GIS) to process and manage spatial dat a layers and art icial neuralnetwork (ANN) toolsto learn about pat ter ns in input (i. e.,drivers ) and out put (e. g., his toric al land use change) data. The model has been used to forecast LUCC pat- terns in a variety of places around the world, such as the Midwest USA (Pijanowski et al., 2005), central Europe (Pij anowski et al., 2006), East Af ri ca (Olson et al., 2008; Was hingt on-Ottombre et al., 2010; Pijanowski et al., 2011) and Asia (Pijanowski et al., 2009). Forecasts are often linked to climate (Moore et al., 2010, 2011), hydrologic (Tang et al., 2005a,b; Yang et al., 2010) or bio- logical (Wiley et al., 2010) models to examine how whateif LUCC scenarios impact the envi ronment (e.g.  Ray et al., 2011) and/or economics (Skole et al., 2002). The LTM has even been engineered to run  backwards  (Ray and Pijanowski, 2010) in order to examine environmental impacts of historical LUCC or the effects of land use legacies on slow environmental processes, such as groundwater transport through watersheds (Wayland et al., 2002; Pijanowski et al., 2007; Ray et al., 2012). The LTM has been recently extended to simulate and predict urban boundary change (Pijanowski et al., 2009;  Tayy ebi, and Perr y, 2013) which can be used by urban planners and managers interested in the control of urban growth. Modeling, especially in a spatially explicit way, allows for con- ducting experiments that quantify the importance of various LUCC drivers, contributing to a better understanding of key LUCC pro- cesses (Veldkamp and Lambi n, 2001; Burton et al., 2008; Pont ius * Corresponding author. Tel.:  þ1 765 496 2215. E-mail address: bpijanow@purdue.edu (B.C. Pijanowski). Contents lists available at  ScienceDirect Environmental Modelling & Software journal homepage:  www.elsevier.com/locate/envsoft 1364-8152/$  e see front matter  2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.envsoft.2013.09.015 Environmental Modelling & Software 51 (2014) 250e268

Transcript of A Big Data Urban Growth big dataSimulation at a National Scale

Page 1: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 119

A big data urban growth simulation at a national scale Con1047297guringthe GIS and neural network based Land Transformation Model to runin a High Performance Computing (HPC) environment

Bryan C Pijanowski a Amin Tayyebi ab Jarrod Doucette a Burak K Pekin acDavid Braun de James Plourde af

a Department of Forestry and Natural Resources Purdue University 195 Marsteller Street West Lafayette IN 47907 USAb Department of Entomology University of Wisconsin Madison WI 53706 USAc Institute for Conservation Research San Diego Zoo Global 15600 San Pasqual Valley Road Escondido CA 92027 USA

d Rosen Center for Advanced Computing Information Technology Division Purdue University West Lafayette IN 47907 USAe Thavron Solutions Kokomo IN 46906 USAf Worldwide Construction and Foresy Division John Deere 1515 5th Avenue Moine IL 61265 USA

a r t i c l e i n f o

Article history

Received 5 April 2013

Received in revised form

18 September 2013

Accepted 23 September 2013

Available online 7 November 2013

Keywords

Land use land cover change

Big data simulationLand Transformation Model

High Performance Computing

Extensible Markup Language

Python environment

Visual Studio 10 (C)

Continental scale

a b s t r a c t

The Land Transformation Model (LTM) is a Land Use Land Cover Change (LUCC) model which was

originally developed to simulate local scale LUCC patterns The model uses a commercial windows-based

GIS program to process and manage spatial data and an arti1047297cial neural network (ANN) program within a

series of batch routines to learn about spatial patterns in data In this paper we provide an overview of a

redesigned LTM capable of running at continental scales and at a 1047297ne (30m) resolution using a new

architecture that employs a windows-based High Performance Computing (HPC) cluster This paper

provides an overview of the new architecture which we discuss within the context of modeling LUCC

that requires (1) using an HPC to run a modi1047297ed version of our LTM (2) managing large datasets in

terms of size and quantity of 1047297

les (3) integration of tools that are executed using different scriptinglanguages and (4) a large number of steps necessitating several aspects of job management

2013 Elsevier Ltd All rights reserved

1 Introduction

The Land Transformation Model was developed over 1047297fteen

years ago (Pijanowski et al 1997 2000 2002ab) to simulate spatial

patterns of land use land cover change (LUCC) over time The model

uses geographic information systems (GIS) to process and managespatial data layers and arti1047297cial neuralnetwork (ANN) tools to learn

about patterns in input (iedrivers) and output (eg historical land

use change) data The model has been used to forecast LUCC pat-

terns in a variety of places around the world such as the Midwest

USA (Pijanowski et al 2005) central Europe (Pijanowski et al

2006) East Africa (Olson et al 2008 Washington-Ottombre

et al 2010 Pijanowski et al 2011) and Asia (Pijanowski et al

2009) Forecasts are often linked to climate (Moore et al 2010

2011) hydrologic (Tang et al 2005ab Yang et al 2010) or bio-

logical (Wiley et al 2010) models to examine how whateif LUCC

scenarios impact the environment (eg Ray et al 2011) andor

economics (Skole et al 2002) The LTM has even been engineered

to run ldquobackwardsrdquo (Ray and Pijanowski 2010) in order to examineenvironmental impacts of historical LUCC or the effects of land use

legacies on slow environmental processes such as groundwater

transport through watersheds (Wayland et al 2002 Pijanowski

et al 2007 Ray et al 2012) The LTM has been recently extended

to simulate and predict urban boundary change (Pijanowski et al

2009 Tayyebi and Perry 2013) which can be used by urban

planners and managers interested in the control of urban growth

Modeling especially in a spatially explicit way allows for con-

ducting experiments that quantify the importance of various LUCC

drivers contributing to a better understanding of key LUCC pro-

cesses (Veldkamp and Lambin 2001 Burton et al 2008 Pontius Corresponding author Tel thorn1 765 496 2215

E-mail address bpijanowpurdueedu (BC Pijanowski)

Contents lists available at ScienceDirect

Environmental Modelling amp Software

j o u r n a l h o m e p a g e w w w e l s e v i e r c o m l o c a t e e n v s o f t

1364-8152$ e see front matter 2013 Elsevier Ltd All rights reserved

httpdxdoiorg101016jenvsoft201309015

Environmental Modelling amp Software 51 (2014) 250e268

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 219

and Petrova 2010 Anselme et al 2010 Peacuterez-Vega et al 2012)

Large-scale LUCC models are needed to understand regional con-

tinental to global scale problems like climate change (Chapman

1998 Kilsby et al 2007 Merritt et al 2003 ) human impacts to

ecosystem services (MEA 2005) alterations to carbon sequestra-

tion (Post and Kwon 2000) and dynamics of biogeochemical

cycling (Boutt et al 2001 Pijanowski et al 2002b Wayland et al

2002 Turner et al 2003 GLP 2005 Fitzpatrick et al 2007

Anselme et al 2010 Carpani et al 2012) One of the characteris-

tics of all LTM applications to date is that the size of the simulation

has been small enough to run on a single advanced workstation

However as models originally designed for local to regional sim-

ulations are needed at continental scales or larger a redesign of

single workstation models such as the LTM becomes necessary

Indeed recent calls by the scienti1047297c and policy community for the

development and use of large-scale models (eg Earth Systems

Science community eg Randerson et al 2009 Xue et al 2010

Lagabrielle et al 2010) underscores the importance of focusing

attention on large-scale environmental modeling

Increasing the size of any computer simulation like the LTM

has several challenges (Herold et al 2003 2005 Lei et al 2005

Dietzel and Clarke 2007 Clarke et al 2007 Adeloye et al 2012)

First is the need to manage large datasets that are used as inputsand are output by the model (Yang 2011 Loepfe et al 2011) The

national-scale application of the LTM that we present here sim-

ulates LUCC for the lower 48 states of the USA at 30m resolution

This represents a simulation domain of 154 105 by 975 104

cells (ie over 15 billion cells) Additionally as many as 10 spatial

drivers are used per simulation and up to 10 forecast maps (a 50-

year projection with 5-year time steps) are created Forecasts may

also involve multiple forecast scenarios with multiple time steps

a recent LTM application (Ray et al 2011) compared 36 different

LTM forecast scenarios for a single regional watershed for 2000

through 2070 at 1047297ve-year time steps Thus the number of cells

within each simulation can be very large and can easily exceed

one trillion A second challenge presented by modeling large re-

gions is the need to create and manage a large number of 1047297les in avariety of formats For the LTM this requires managing input

programs and tools and output 1047297les for GIS and neural network

software For the national-scale LTM described below we used the

GIS to split the simulation into over 20000 census-designated

places (eg towns villages and cities) which were then stored

in folders in a hierarchical structure Thus standards for 1047297le

naming and use in automated routines are necessary to properly

manage numerous 1047297les Third since we are using a variety of

tools such as ESRIrsquos ArcGIS Desktop and Stuttgart Neural Network

Simulator (SNNS) each with their own scripting language a

higher-level architecture that automates the control of multiple

programs is needed with such models Fourth since many exe-

cutions occur during the simulation knowing when failure occurs

is necessary and thus the status and progress of the simulationneeds to be tracked Finally given that the LTM contains

numerous programs and scripts a way to manage the processing

of jobs becomes necessary All of these challenges are inherent in

what some scienti1047297c communities call the big data problem

(Lynch 2008 Hey 2009 Jacobs 2009 LaValle et al 2011) These

challenges require solutions that are different from simulations

that are executed on single workstations

In this paper we describe how we have con1047297gured a single

workstation version of the LTM to run in a Windows-based High

Performance Computing (HPC) environment for a version of the

Land Transformation Model we call the LTM-HPC We summarize

the important architectural features of this version of the model

providing both 1047298ow diagrams of the processing steps maps of

data layers used in the simulation as well as pseudo-code that

illustrates how 1047297les and routines are handled This paper will

assist others who are interested in (1) using arti1047297cial neural

networks to learn about patterns in spatial data where data are

large andor (2) using HPC tools to recon1047297gure an environmental

model composed of a series of programs not linked to a graphical

user interface

We organize the remainder of this paper as follows Section 2

provides an overview of the original LTM introducing basic

modeling terms summarizing important features of high perfor-

mance compute (HPC) environment and the architecture of the

current LTM as it is con1047297gured for an HPC Section 3 describes a

speci1047297c application of the LTM-HPC run at a national scale for urban

change for the conterminous USA The paper concludes by discus-

sing the potential of the LTM-HPC for simulating 1047297ne resolution

urban growth patterns at large regional scales as well as the use-

fulness of such projections

2 Brief background

21 Overview of the Land Transformation Model (LTM) and

Arti 1047297cial Neural Networks

The LTM (Pijanowski et al 2000 2002a 2009) simulates land

usecover change based on socio-economic and bio-physical factors

using an Arti1047297cial Neural Network (ANN) and a raster GIS modeling

environment Its previous as well as the current architecture

summarized here is based on scripts and a sequential series of

executable programs that provide considerable 1047298exibility in

running the model There are no graphical user interfaces for the

modelAt the highest level of organization (Fig1) the LTM contains

six major components (1) a data preparation set of routines and

procedures many of which are conducted in a GIS (2) a series of

steps called pattern recognition that allow an arti1047297cial neural

network to learn about patterns in input (drivers of land use

change) and output (historical change or no change in land use)

data which are then applied to an independent set of data and

output values are estimated and (3) a sequence of C and GISbased programs for model calibration (4) an independent

assessment of model performance or model validation also writ-

ten in C and GIS (5) routines used those for creating future

scenarios of land use and (6) model products and applications

conducted within a GIS

In the LTM we use a multi-layer perceptron (MLP) ANN within

the Stuttgart Neural Network Simulator (SNNS) software to

approximate an unknown relation between input (eg drivers of

change) and output (eg locations of change and no change)

Typical inputs include distance to roads slope and distance to

Fig 1 Main components of the Land Transformation Model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 251

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319

previous urban (Fig 2) Outputs are binary values of change (1) and

no change (0) in observed land use maps Input values are fed

through a hidden layer with the number of nodes equal to that of

inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses

learning rules to determine the weights values for bias and acti-

vation function to 1047297t input and output values (Fig 2) of a dataset

Delta rules are used to adjust all of these values across successive

passes of the data each pass iscalled a cycle A mean square error is

calculated for each cycle from a back propagation algorithm

(Bishop 1995 Dlamini 2008) values for weights bias and activa-

tion function are then adjusted and the training stopped after a

global minimum MSE is reached The process of cycling through is

called training In the LTM we use a small randomly selected

subsample (between 5 and 10) of the data to train Applying the

weights values for bias and the activation functions from a training

run to another dataset that contain inputs only in order to estimate

output is referring to as testing We conduct a double phase testing

with the LTM (Tayyebi et al 2012) at large scales (eg contermi-

nous of USA) The 1047297rst phase of testing is to use the weights bias

and activation values saved from the training of the subsample and

apply the values to the entire dataset A set of goodness of 1047297t sta-

tistics are generatedbetween the predicted and observed maps we

also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input

data layer is held back from the same testing dataset and goodness

of 1047297t of the reduced input models is compared against the full

complement model (see Pijanowski et al 2002a) Thus for the LTM

simulation below we used 1047297ve input maps to predict one map of

urban change We held one out at a time to produce 1047297ve input map

models with the same urban change map These reduced input

models are compared with the full complement model of six input

maps

We follow the recommendation of Pontius et al (2004) and

Pontius and Spencer (2005) of validating the model e our second

phase of testing e which is done with a different dataset than what

is used for the 1047297rst phase of testing This independent dataset can

be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive

ability) It is typical practice (Bishop 1995) to use different data for

training and testing which is done here

Forecasting is accomplished using a quantity model developed

using per capita land use growth rates and a population growth

estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)

The quantity model can be applied across the entire simulation

domain with one quantity estimate per time step or the quantity

model can be applied to smaller spatial units across the simulation

domain as done here

22 High Performance Computing (HPC)

High Performance Computing (HPC) integrates computer ar-

chitecture design principles operating system software heteroge-

neous hardware components programs algorithms and

specialized computational approaches to address the handling of

tasks not possible or practical with a single computer workstation

(Foster and Kesselman 1997 Foster et al 2002) A self-contained

HPC (ie a group of computers) is often referred to as a high per-

formance compute cluster (HPCC) (cf Cheung and Reeves 1992

Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of

HPCs is the integration of hardware and software systems that are

con1047297gured to parse large processing jobs into smaller parallel tasks

Hardware resources can be managed at the level of cores (a single

processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual

servers or computers that contain one or more sockets) The HPCC

employed here is speci1047297cally con1047297gured to control the execution of

several batch 1047297les executable programs and scripts for thousands

of input and output data 1047297les An HPCC is managed by an admin-

istrator with hardware and software services accessible to many

users HPCCs are systems smaller than supercomputers although

the term HPC and supercomputer are often used interchangeably

We controlled all our LTM-HPC programs on the HPCC using

Windows HPC 2008 Server R2 job manager which has features

common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and

tasks allocating resources dispatching the tasks to the compute

nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language

(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa

resource request that is submitted to the job scheduler service that

Estimated error from observed

data (back propagation errors)

Assign weights bias and activation

function to estimate output

A pass forward and back is

called a cycle or epoch

Slope

Distance to stream

Distance to urban

Distance to primary road

Distance to secondary road

Presence = 1 or absence = 0

of a land use transition

Output NodesHidden NodesInput Nodes

Fig 2 Structure of an arti1047297

cial neural network

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419

assigns hardware resources to all tasks and (4) a task which is a

command (ie a program or script) with path names for input and

output 1047297les and software resources assigned for each task Many

jobs and all tasks are run in parallel across multiple processors The

HPC job manager is the primary interface for submitting LTM-HPC

jobs to a cluster it uses a graphical user interface Jobs are also

submitted from a remote desktop using a client utility in Microsoft

Windows HPC Pack 2008

Fig 3 shows sample lines from an XML job description 1047297le used

below to create forecast maps by state Note that the highest level

contains job parameters parameters are passed to the HPC Server

for project name user name job type types and level of hardware

resources for the job etc Tasks are listed after as dependencies to

the higher-level job tasks here contain several parameters (eg

how the hardware resources are used) and commands (eg name

of the Python script to execute and the parameters for that script

such as the name of the input and output 1047297le names)

3 Architecture of the LTM and LTM-HPC

31 Main components

Several modi1047297cations were made to the LTM to make the LTM-

HPC run at larger spatial scales (ie larger datasets) and with 1047297ne

resolution Below we describe the structure of the components

that comprise the current version of the LTM (hereafter as the

single workstation LTM) and the features that were necessary to

recon1047297gure it for an HPCC There are several different kinds of

programming environments that comprise the single worksta-

tion LTM The 1047297rst are command-line interpreter instructions

con1047297gured as batch 1047297les for use in the Windows operating sys-

tem these are named using the BAT extension Batch 1047297les

control most of the processing of data for Stuttgart Neural

Network Simulator (SNNS) A second type of programming

environment that comprises the LTM are compiled programs

written to accept environment variables as inputs Programs are

written in C or C programming language as a standalone EXE

1047297le to be executed at the command line The environment vari-

ables for these programs are often the location and name of input

and output 1047297les Complied programs are used to transpose data

structures and to calculate very speci1047297c values during model

calibration The third kind of program environment is the script

environment written to execute application-speci1047297c tools

Application-speci1047297c scripts that we use here are ArcGIS Python

(version 26 or higher) scripts which call certain features and

commands of ArcGIS and Spatial Analyst A fourth type of soft-

ware environment is the XML jobs 1047297le these are used by the

Windows 2008 Server R2 job manager of the LTM-HPC to execute

and organize the batch routines compiled programs and scripts

in the proper order and with the necessary environment vari-

ables This fourth kind of software environment the XML jobs

1047297le is only present in the LTM-HPC

Fig 4 shows the sequence of batch routines programs and

scripts that comprise the LTM currently organized into the six main

model components data preparation pattern recognition cali-

bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing

how these features enable us to simulate land use cover change at a

national scale those batch routines and programs that have been

modi1047297ed for running in the HPC environment and con1047297gured using

XML job 1047297les are contained in the red boxes in Fig 4

32 Data preparation

Data preparation for training and testing runs in the LTM and

LTM-HPC is conducted using a GIS and spatial databases ( Fig 4

Fig 3 XML job 1047297

le illustrating the syntax for job parameters task parameters and task commands

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519

Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege

article)

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 2: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 219

and Petrova 2010 Anselme et al 2010 Peacuterez-Vega et al 2012)

Large-scale LUCC models are needed to understand regional con-

tinental to global scale problems like climate change (Chapman

1998 Kilsby et al 2007 Merritt et al 2003 ) human impacts to

ecosystem services (MEA 2005) alterations to carbon sequestra-

tion (Post and Kwon 2000) and dynamics of biogeochemical

cycling (Boutt et al 2001 Pijanowski et al 2002b Wayland et al

2002 Turner et al 2003 GLP 2005 Fitzpatrick et al 2007

Anselme et al 2010 Carpani et al 2012) One of the characteris-

tics of all LTM applications to date is that the size of the simulation

has been small enough to run on a single advanced workstation

However as models originally designed for local to regional sim-

ulations are needed at continental scales or larger a redesign of

single workstation models such as the LTM becomes necessary

Indeed recent calls by the scienti1047297c and policy community for the

development and use of large-scale models (eg Earth Systems

Science community eg Randerson et al 2009 Xue et al 2010

Lagabrielle et al 2010) underscores the importance of focusing

attention on large-scale environmental modeling

Increasing the size of any computer simulation like the LTM

has several challenges (Herold et al 2003 2005 Lei et al 2005

Dietzel and Clarke 2007 Clarke et al 2007 Adeloye et al 2012)

First is the need to manage large datasets that are used as inputsand are output by the model (Yang 2011 Loepfe et al 2011) The

national-scale application of the LTM that we present here sim-

ulates LUCC for the lower 48 states of the USA at 30m resolution

This represents a simulation domain of 154 105 by 975 104

cells (ie over 15 billion cells) Additionally as many as 10 spatial

drivers are used per simulation and up to 10 forecast maps (a 50-

year projection with 5-year time steps) are created Forecasts may

also involve multiple forecast scenarios with multiple time steps

a recent LTM application (Ray et al 2011) compared 36 different

LTM forecast scenarios for a single regional watershed for 2000

through 2070 at 1047297ve-year time steps Thus the number of cells

within each simulation can be very large and can easily exceed

one trillion A second challenge presented by modeling large re-

gions is the need to create and manage a large number of 1047297les in avariety of formats For the LTM this requires managing input

programs and tools and output 1047297les for GIS and neural network

software For the national-scale LTM described below we used the

GIS to split the simulation into over 20000 census-designated

places (eg towns villages and cities) which were then stored

in folders in a hierarchical structure Thus standards for 1047297le

naming and use in automated routines are necessary to properly

manage numerous 1047297les Third since we are using a variety of

tools such as ESRIrsquos ArcGIS Desktop and Stuttgart Neural Network

Simulator (SNNS) each with their own scripting language a

higher-level architecture that automates the control of multiple

programs is needed with such models Fourth since many exe-

cutions occur during the simulation knowing when failure occurs

is necessary and thus the status and progress of the simulationneeds to be tracked Finally given that the LTM contains

numerous programs and scripts a way to manage the processing

of jobs becomes necessary All of these challenges are inherent in

what some scienti1047297c communities call the big data problem

(Lynch 2008 Hey 2009 Jacobs 2009 LaValle et al 2011) These

challenges require solutions that are different from simulations

that are executed on single workstations

In this paper we describe how we have con1047297gured a single

workstation version of the LTM to run in a Windows-based High

Performance Computing (HPC) environment for a version of the

Land Transformation Model we call the LTM-HPC We summarize

the important architectural features of this version of the model

providing both 1047298ow diagrams of the processing steps maps of

data layers used in the simulation as well as pseudo-code that

illustrates how 1047297les and routines are handled This paper will

assist others who are interested in (1) using arti1047297cial neural

networks to learn about patterns in spatial data where data are

large andor (2) using HPC tools to recon1047297gure an environmental

model composed of a series of programs not linked to a graphical

user interface

We organize the remainder of this paper as follows Section 2

provides an overview of the original LTM introducing basic

modeling terms summarizing important features of high perfor-

mance compute (HPC) environment and the architecture of the

current LTM as it is con1047297gured for an HPC Section 3 describes a

speci1047297c application of the LTM-HPC run at a national scale for urban

change for the conterminous USA The paper concludes by discus-

sing the potential of the LTM-HPC for simulating 1047297ne resolution

urban growth patterns at large regional scales as well as the use-

fulness of such projections

2 Brief background

21 Overview of the Land Transformation Model (LTM) and

Arti 1047297cial Neural Networks

The LTM (Pijanowski et al 2000 2002a 2009) simulates land

usecover change based on socio-economic and bio-physical factors

using an Arti1047297cial Neural Network (ANN) and a raster GIS modeling

environment Its previous as well as the current architecture

summarized here is based on scripts and a sequential series of

executable programs that provide considerable 1047298exibility in

running the model There are no graphical user interfaces for the

modelAt the highest level of organization (Fig1) the LTM contains

six major components (1) a data preparation set of routines and

procedures many of which are conducted in a GIS (2) a series of

steps called pattern recognition that allow an arti1047297cial neural

network to learn about patterns in input (drivers of land use

change) and output (historical change or no change in land use)

data which are then applied to an independent set of data and

output values are estimated and (3) a sequence of C and GISbased programs for model calibration (4) an independent

assessment of model performance or model validation also writ-

ten in C and GIS (5) routines used those for creating future

scenarios of land use and (6) model products and applications

conducted within a GIS

In the LTM we use a multi-layer perceptron (MLP) ANN within

the Stuttgart Neural Network Simulator (SNNS) software to

approximate an unknown relation between input (eg drivers of

change) and output (eg locations of change and no change)

Typical inputs include distance to roads slope and distance to

Fig 1 Main components of the Land Transformation Model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 251

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319

previous urban (Fig 2) Outputs are binary values of change (1) and

no change (0) in observed land use maps Input values are fed

through a hidden layer with the number of nodes equal to that of

inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses

learning rules to determine the weights values for bias and acti-

vation function to 1047297t input and output values (Fig 2) of a dataset

Delta rules are used to adjust all of these values across successive

passes of the data each pass iscalled a cycle A mean square error is

calculated for each cycle from a back propagation algorithm

(Bishop 1995 Dlamini 2008) values for weights bias and activa-

tion function are then adjusted and the training stopped after a

global minimum MSE is reached The process of cycling through is

called training In the LTM we use a small randomly selected

subsample (between 5 and 10) of the data to train Applying the

weights values for bias and the activation functions from a training

run to another dataset that contain inputs only in order to estimate

output is referring to as testing We conduct a double phase testing

with the LTM (Tayyebi et al 2012) at large scales (eg contermi-

nous of USA) The 1047297rst phase of testing is to use the weights bias

and activation values saved from the training of the subsample and

apply the values to the entire dataset A set of goodness of 1047297t sta-

tistics are generatedbetween the predicted and observed maps we

also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input

data layer is held back from the same testing dataset and goodness

of 1047297t of the reduced input models is compared against the full

complement model (see Pijanowski et al 2002a) Thus for the LTM

simulation below we used 1047297ve input maps to predict one map of

urban change We held one out at a time to produce 1047297ve input map

models with the same urban change map These reduced input

models are compared with the full complement model of six input

maps

We follow the recommendation of Pontius et al (2004) and

Pontius and Spencer (2005) of validating the model e our second

phase of testing e which is done with a different dataset than what

is used for the 1047297rst phase of testing This independent dataset can

be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive

ability) It is typical practice (Bishop 1995) to use different data for

training and testing which is done here

Forecasting is accomplished using a quantity model developed

using per capita land use growth rates and a population growth

estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)

The quantity model can be applied across the entire simulation

domain with one quantity estimate per time step or the quantity

model can be applied to smaller spatial units across the simulation

domain as done here

22 High Performance Computing (HPC)

High Performance Computing (HPC) integrates computer ar-

chitecture design principles operating system software heteroge-

neous hardware components programs algorithms and

specialized computational approaches to address the handling of

tasks not possible or practical with a single computer workstation

(Foster and Kesselman 1997 Foster et al 2002) A self-contained

HPC (ie a group of computers) is often referred to as a high per-

formance compute cluster (HPCC) (cf Cheung and Reeves 1992

Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of

HPCs is the integration of hardware and software systems that are

con1047297gured to parse large processing jobs into smaller parallel tasks

Hardware resources can be managed at the level of cores (a single

processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual

servers or computers that contain one or more sockets) The HPCC

employed here is speci1047297cally con1047297gured to control the execution of

several batch 1047297les executable programs and scripts for thousands

of input and output data 1047297les An HPCC is managed by an admin-

istrator with hardware and software services accessible to many

users HPCCs are systems smaller than supercomputers although

the term HPC and supercomputer are often used interchangeably

We controlled all our LTM-HPC programs on the HPCC using

Windows HPC 2008 Server R2 job manager which has features

common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and

tasks allocating resources dispatching the tasks to the compute

nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language

(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa

resource request that is submitted to the job scheduler service that

Estimated error from observed

data (back propagation errors)

Assign weights bias and activation

function to estimate output

A pass forward and back is

called a cycle or epoch

Slope

Distance to stream

Distance to urban

Distance to primary road

Distance to secondary road

Presence = 1 or absence = 0

of a land use transition

Output NodesHidden NodesInput Nodes

Fig 2 Structure of an arti1047297

cial neural network

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419

assigns hardware resources to all tasks and (4) a task which is a

command (ie a program or script) with path names for input and

output 1047297les and software resources assigned for each task Many

jobs and all tasks are run in parallel across multiple processors The

HPC job manager is the primary interface for submitting LTM-HPC

jobs to a cluster it uses a graphical user interface Jobs are also

submitted from a remote desktop using a client utility in Microsoft

Windows HPC Pack 2008

Fig 3 shows sample lines from an XML job description 1047297le used

below to create forecast maps by state Note that the highest level

contains job parameters parameters are passed to the HPC Server

for project name user name job type types and level of hardware

resources for the job etc Tasks are listed after as dependencies to

the higher-level job tasks here contain several parameters (eg

how the hardware resources are used) and commands (eg name

of the Python script to execute and the parameters for that script

such as the name of the input and output 1047297le names)

3 Architecture of the LTM and LTM-HPC

31 Main components

Several modi1047297cations were made to the LTM to make the LTM-

HPC run at larger spatial scales (ie larger datasets) and with 1047297ne

resolution Below we describe the structure of the components

that comprise the current version of the LTM (hereafter as the

single workstation LTM) and the features that were necessary to

recon1047297gure it for an HPCC There are several different kinds of

programming environments that comprise the single worksta-

tion LTM The 1047297rst are command-line interpreter instructions

con1047297gured as batch 1047297les for use in the Windows operating sys-

tem these are named using the BAT extension Batch 1047297les

control most of the processing of data for Stuttgart Neural

Network Simulator (SNNS) A second type of programming

environment that comprises the LTM are compiled programs

written to accept environment variables as inputs Programs are

written in C or C programming language as a standalone EXE

1047297le to be executed at the command line The environment vari-

ables for these programs are often the location and name of input

and output 1047297les Complied programs are used to transpose data

structures and to calculate very speci1047297c values during model

calibration The third kind of program environment is the script

environment written to execute application-speci1047297c tools

Application-speci1047297c scripts that we use here are ArcGIS Python

(version 26 or higher) scripts which call certain features and

commands of ArcGIS and Spatial Analyst A fourth type of soft-

ware environment is the XML jobs 1047297le these are used by the

Windows 2008 Server R2 job manager of the LTM-HPC to execute

and organize the batch routines compiled programs and scripts

in the proper order and with the necessary environment vari-

ables This fourth kind of software environment the XML jobs

1047297le is only present in the LTM-HPC

Fig 4 shows the sequence of batch routines programs and

scripts that comprise the LTM currently organized into the six main

model components data preparation pattern recognition cali-

bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing

how these features enable us to simulate land use cover change at a

national scale those batch routines and programs that have been

modi1047297ed for running in the HPC environment and con1047297gured using

XML job 1047297les are contained in the red boxes in Fig 4

32 Data preparation

Data preparation for training and testing runs in the LTM and

LTM-HPC is conducted using a GIS and spatial databases ( Fig 4

Fig 3 XML job 1047297

le illustrating the syntax for job parameters task parameters and task commands

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519

Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege

article)

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 3: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319

previous urban (Fig 2) Outputs are binary values of change (1) and

no change (0) in observed land use maps Input values are fed

through a hidden layer with the number of nodes equal to that of

inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses

learning rules to determine the weights values for bias and acti-

vation function to 1047297t input and output values (Fig 2) of a dataset

Delta rules are used to adjust all of these values across successive

passes of the data each pass iscalled a cycle A mean square error is

calculated for each cycle from a back propagation algorithm

(Bishop 1995 Dlamini 2008) values for weights bias and activa-

tion function are then adjusted and the training stopped after a

global minimum MSE is reached The process of cycling through is

called training In the LTM we use a small randomly selected

subsample (between 5 and 10) of the data to train Applying the

weights values for bias and the activation functions from a training

run to another dataset that contain inputs only in order to estimate

output is referring to as testing We conduct a double phase testing

with the LTM (Tayyebi et al 2012) at large scales (eg contermi-

nous of USA) The 1047297rst phase of testing is to use the weights bias

and activation values saved from the training of the subsample and

apply the values to the entire dataset A set of goodness of 1047297t sta-

tistics are generatedbetween the predicted and observed maps we

also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input

data layer is held back from the same testing dataset and goodness

of 1047297t of the reduced input models is compared against the full

complement model (see Pijanowski et al 2002a) Thus for the LTM

simulation below we used 1047297ve input maps to predict one map of

urban change We held one out at a time to produce 1047297ve input map

models with the same urban change map These reduced input

models are compared with the full complement model of six input

maps

We follow the recommendation of Pontius et al (2004) and

Pontius and Spencer (2005) of validating the model e our second

phase of testing e which is done with a different dataset than what

is used for the 1047297rst phase of testing This independent dataset can

be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive

ability) It is typical practice (Bishop 1995) to use different data for

training and testing which is done here

Forecasting is accomplished using a quantity model developed

using per capita land use growth rates and a population growth

estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)

The quantity model can be applied across the entire simulation

domain with one quantity estimate per time step or the quantity

model can be applied to smaller spatial units across the simulation

domain as done here

22 High Performance Computing (HPC)

High Performance Computing (HPC) integrates computer ar-

chitecture design principles operating system software heteroge-

neous hardware components programs algorithms and

specialized computational approaches to address the handling of

tasks not possible or practical with a single computer workstation

(Foster and Kesselman 1997 Foster et al 2002) A self-contained

HPC (ie a group of computers) is often referred to as a high per-

formance compute cluster (HPCC) (cf Cheung and Reeves 1992

Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of

HPCs is the integration of hardware and software systems that are

con1047297gured to parse large processing jobs into smaller parallel tasks

Hardware resources can be managed at the level of cores (a single

processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual

servers or computers that contain one or more sockets) The HPCC

employed here is speci1047297cally con1047297gured to control the execution of

several batch 1047297les executable programs and scripts for thousands

of input and output data 1047297les An HPCC is managed by an admin-

istrator with hardware and software services accessible to many

users HPCCs are systems smaller than supercomputers although

the term HPC and supercomputer are often used interchangeably

We controlled all our LTM-HPC programs on the HPCC using

Windows HPC 2008 Server R2 job manager which has features

common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and

tasks allocating resources dispatching the tasks to the compute

nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language

(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa

resource request that is submitted to the job scheduler service that

Estimated error from observed

data (back propagation errors)

Assign weights bias and activation

function to estimate output

A pass forward and back is

called a cycle or epoch

Slope

Distance to stream

Distance to urban

Distance to primary road

Distance to secondary road

Presence = 1 or absence = 0

of a land use transition

Output NodesHidden NodesInput Nodes

Fig 2 Structure of an arti1047297

cial neural network

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419

assigns hardware resources to all tasks and (4) a task which is a

command (ie a program or script) with path names for input and

output 1047297les and software resources assigned for each task Many

jobs and all tasks are run in parallel across multiple processors The

HPC job manager is the primary interface for submitting LTM-HPC

jobs to a cluster it uses a graphical user interface Jobs are also

submitted from a remote desktop using a client utility in Microsoft

Windows HPC Pack 2008

Fig 3 shows sample lines from an XML job description 1047297le used

below to create forecast maps by state Note that the highest level

contains job parameters parameters are passed to the HPC Server

for project name user name job type types and level of hardware

resources for the job etc Tasks are listed after as dependencies to

the higher-level job tasks here contain several parameters (eg

how the hardware resources are used) and commands (eg name

of the Python script to execute and the parameters for that script

such as the name of the input and output 1047297le names)

3 Architecture of the LTM and LTM-HPC

31 Main components

Several modi1047297cations were made to the LTM to make the LTM-

HPC run at larger spatial scales (ie larger datasets) and with 1047297ne

resolution Below we describe the structure of the components

that comprise the current version of the LTM (hereafter as the

single workstation LTM) and the features that were necessary to

recon1047297gure it for an HPCC There are several different kinds of

programming environments that comprise the single worksta-

tion LTM The 1047297rst are command-line interpreter instructions

con1047297gured as batch 1047297les for use in the Windows operating sys-

tem these are named using the BAT extension Batch 1047297les

control most of the processing of data for Stuttgart Neural

Network Simulator (SNNS) A second type of programming

environment that comprises the LTM are compiled programs

written to accept environment variables as inputs Programs are

written in C or C programming language as a standalone EXE

1047297le to be executed at the command line The environment vari-

ables for these programs are often the location and name of input

and output 1047297les Complied programs are used to transpose data

structures and to calculate very speci1047297c values during model

calibration The third kind of program environment is the script

environment written to execute application-speci1047297c tools

Application-speci1047297c scripts that we use here are ArcGIS Python

(version 26 or higher) scripts which call certain features and

commands of ArcGIS and Spatial Analyst A fourth type of soft-

ware environment is the XML jobs 1047297le these are used by the

Windows 2008 Server R2 job manager of the LTM-HPC to execute

and organize the batch routines compiled programs and scripts

in the proper order and with the necessary environment vari-

ables This fourth kind of software environment the XML jobs

1047297le is only present in the LTM-HPC

Fig 4 shows the sequence of batch routines programs and

scripts that comprise the LTM currently organized into the six main

model components data preparation pattern recognition cali-

bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing

how these features enable us to simulate land use cover change at a

national scale those batch routines and programs that have been

modi1047297ed for running in the HPC environment and con1047297gured using

XML job 1047297les are contained in the red boxes in Fig 4

32 Data preparation

Data preparation for training and testing runs in the LTM and

LTM-HPC is conducted using a GIS and spatial databases ( Fig 4

Fig 3 XML job 1047297

le illustrating the syntax for job parameters task parameters and task commands

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519

Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege

article)

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 4: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419

assigns hardware resources to all tasks and (4) a task which is a

command (ie a program or script) with path names for input and

output 1047297les and software resources assigned for each task Many

jobs and all tasks are run in parallel across multiple processors The

HPC job manager is the primary interface for submitting LTM-HPC

jobs to a cluster it uses a graphical user interface Jobs are also

submitted from a remote desktop using a client utility in Microsoft

Windows HPC Pack 2008

Fig 3 shows sample lines from an XML job description 1047297le used

below to create forecast maps by state Note that the highest level

contains job parameters parameters are passed to the HPC Server

for project name user name job type types and level of hardware

resources for the job etc Tasks are listed after as dependencies to

the higher-level job tasks here contain several parameters (eg

how the hardware resources are used) and commands (eg name

of the Python script to execute and the parameters for that script

such as the name of the input and output 1047297le names)

3 Architecture of the LTM and LTM-HPC

31 Main components

Several modi1047297cations were made to the LTM to make the LTM-

HPC run at larger spatial scales (ie larger datasets) and with 1047297ne

resolution Below we describe the structure of the components

that comprise the current version of the LTM (hereafter as the

single workstation LTM) and the features that were necessary to

recon1047297gure it for an HPCC There are several different kinds of

programming environments that comprise the single worksta-

tion LTM The 1047297rst are command-line interpreter instructions

con1047297gured as batch 1047297les for use in the Windows operating sys-

tem these are named using the BAT extension Batch 1047297les

control most of the processing of data for Stuttgart Neural

Network Simulator (SNNS) A second type of programming

environment that comprises the LTM are compiled programs

written to accept environment variables as inputs Programs are

written in C or C programming language as a standalone EXE

1047297le to be executed at the command line The environment vari-

ables for these programs are often the location and name of input

and output 1047297les Complied programs are used to transpose data

structures and to calculate very speci1047297c values during model

calibration The third kind of program environment is the script

environment written to execute application-speci1047297c tools

Application-speci1047297c scripts that we use here are ArcGIS Python

(version 26 or higher) scripts which call certain features and

commands of ArcGIS and Spatial Analyst A fourth type of soft-

ware environment is the XML jobs 1047297le these are used by the

Windows 2008 Server R2 job manager of the LTM-HPC to execute

and organize the batch routines compiled programs and scripts

in the proper order and with the necessary environment vari-

ables This fourth kind of software environment the XML jobs

1047297le is only present in the LTM-HPC

Fig 4 shows the sequence of batch routines programs and

scripts that comprise the LTM currently organized into the six main

model components data preparation pattern recognition cali-

bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing

how these features enable us to simulate land use cover change at a

national scale those batch routines and programs that have been

modi1047297ed for running in the HPC environment and con1047297gured using

XML job 1047297les are contained in the red boxes in Fig 4

32 Data preparation

Data preparation for training and testing runs in the LTM and

LTM-HPC is conducted using a GIS and spatial databases ( Fig 4

Fig 3 XML job 1047297

le illustrating the syntax for job parameters task parameters and task commands

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519

Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege

article)

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 5: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519

Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege

article)

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 6: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619

item 1) as this is done once for each simulation this LTM

component is not automated A C program called createpatexe

(Fig 4 item 2) is used to convert spatial data to neural net 1047297les

called a pattern 1047297le (Fig 4 item 3) given an PAT extension data

are transposed into the ANN structure Data necessary to process

1047297les for the training run for neural net simulation are model inputs

two land use maps separated by approximately 10 years or more

and a map of locations that need tobe excluded from the neural net

simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg

digital elevation models land usecover maps) are loaded into

ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per

pixel in the simulation domain that are used as inputs to the neural

net A raster 1047297le is selected (eg base land use map) to set ESRI

Spatial Analyst Environment properties such as cell size numberof

row and columns for all data processing to ensure that all inputs

have standard dimensions A separate 1047297le referred to as the

exclusionary zone map (Fig 5 item 1) is created using a GIS

Exclusionary maps contain locations where a land use transition

cannot occur in the future For a model con1047297gured to simulate ur-

ban for example areas that are in protected areas (eg public

parks) open water or are already urban are coded with a lsquo4rsquo in the

exclusionary zone map This exclusionary map is used in several

steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The

coding of locations with a lsquo4rsquo becomes more obvious below under

the presentation of model calibration Inputs (Fig 5 item 2) are

created by applying spatial transition rules outlined in Pijanowski

et al (2000) A frequent input map is distance to roads for our

LTM-HPC application example below Spatial Analystrsquos Euclidean

Distance Tool is used to calculate the distance each pixel is from the

nearest road All GIS data for use in the LTM are written out as an

ASCII 1047298at 1047297le (Fig 5A)

Two land use maps are used to determine the locations of

observed change (Fig 5 3) and these are necessary for the

training runs The program createpatexe (Fig 5B) stores a value of

lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was

observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in

the phase of the model The program createpat uses the same

input and exclusionary maps to create a pattern 1047297le for testing

(Fig 5 item 3)

A key component of the LTM is converting data from a GIS

compatible format to a neural network format called a pattern 1047297le

(Fig 5C) Conversion of 1047297les from raster maps to data for use by the

neural network requires both transposing the database structure

and standardizing all values (Fig 5 6) The maximum value that

occurs in the input maps for training is also stored in the input 1047297le

and this is used to standardize all values from the input maps

because the neural network can only use values between 00 and

10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4

1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not

convert to urban) For training runs createpatexe also selects

subsamples of the databases (by location) the percentage of the

data to be selected is speci1047297ed in the input 1047297le Finally crea-

tepatexe also checks the headers of all maps to ensurethat they are

of the same dimensions

33 Pattern recognition

SNNS has several choices for training the program that per-

forms training and testing is called batchmanexe (Fig 4 item 4)

As this process uses a subset of data and cannot be parallelized

easily we conducted training on a single workstation Batchma-

nexe allows for several options which are employed in the LTM

These include a ldquoshuf 1047298erdquo option which randomly orders the data

presented to the neural network during each pass (ie cycle) (cf

Shellito and Pijanowski 2003 Peralta et al 2010) the values for

initial weights (cf Denoeux and Lengelleacute 1993) the name of the

pattern 1047297les for input and output the 1047297lename containing the

network values and a set of start and stop conditions (eg a stop

condition can be set if a MSE or a certain number of cycles is

reached) We control the speci1047297c batchmanexe execution pa-

rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)

Training is followed over the training cycles with MSE ( Fig 4 item

6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and

activation function values saved every N number of cyclesAn MSE

equal to 00 is a condition that output of ANN matches the data

perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown

that LTM stabilizes after less than 100000 cycles in most cases

Pseudo-code for the TRAINBAT is

loadNet(ldquoltmnetrdquo)

loadPattern(ldquotrainpatrdquo)

setInitFunc (ldquoRandomize_Weightsrdquo 10 10)

setShuf 1047298e (TRUE)

initNet()

trainNet()while MSE gt 00 and CYCLES lt500000 do

if CYCLES mod 100 frac14 frac14 0 then

print (CYCLESldquo rdquoMSE)

endif

if CYCLES frac14 frac14 100 then

saveNet (ldquo100netrdquo)

endif

We used the SNNS batchmanexe program to create a suitability

map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban

change) used for forecasting and calibration to do this data have to

be converted from SNNS format to a GIS compatible format (Fig 4

item 11) The testPAT 1047297les (Fig 4 item 13) are converted to

probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from

the training run using batchmanexe (Fig 4 item 8) Output from

this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES

1047297le contains estimates of output created by the neuralnetworkThe

RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)

using a C program All values from the RES 1047297les are between 00

and 10 convert_asciiexe also stores values in the ASCII suitability

(Fig 4 item 11) maps as integer values between 0 and 100000 by

multiplying the RES 1047297le values by 100000 so that the raster1047297le in

ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a

less ef 1047297cient storage format and thus large 1047298oating point 1047297les

through ArcGIS 100 are unstable)

We also train on data models where we ldquohold one input outrdquo at

a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and

compared to having all inputs in the training Thus if we start with

a 1047297ve input variable neural network model we hold one out at a

time and create calibration time step maps for each and save error

1047297les over training cycles for each of the reduced input variable

models

34 Calibration

For model calibration (see Bennett et al 2013 for an extensive

review of the topic our approach follows their recommendations)

we consider three different sets of metrics to judge the goodness of

1047297t of the neural network model The 1047297rst is mean square error

(MSE) which is plotted over training cycles to ensure that the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 7: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719

Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 8: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819

neural network settles at a global minimum value MSE is calcu-

lated as the difference between the estimate produced by the

neural network (range 00e10) and the observed value of land use

change (0 or 1) MSE values are saved very 100 cycles and training is

generally followed out to about 100000 cycles The second set of

goodness of 1047297t metrics is those created from a calibration map A

calibration map is also constructed within the GIS using three maps

coded specially for assessment of model goodness of 1047297t A map of

observed change between the two historical maps (Fig 4 item 16)

is created such that observed change frac14 0 and no change is frac14 1 A

mapthat predicts the same land use changes over the same amount

of time (Fig 4 15) is coded so that predicted change is frac14 2 and no

predicted change is frac14 0 These two maps are then summed along

with the exclusionary zone map that is coded 0 frac14 location can

change and 4 frac14 location that needs to be excluded The resultant

calibration map (Fig 4 17) generates values 0 through 4 with

correct predictions of 0 frac14 correctly predicted no change and

3 frac14 correctly predicted change Values of 2 and 3 represent

different errors (omission and commission or false positive and

false negative) The proportion of each type of error and correctly

predicted location are used to calculate (1) the proportion of

correctly predicted change locations to the number of observed

change cells also called the percent correct metric (proportion of

correctly predicted land use changes to the number of observed

land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity

(the proportion of false positives) and speci1047297city (the proportion of

false negatives) and (3) scaleable PCM values across different

window sizes

Fig 6 shows how scaleable PCM values are calculated using a

01234-coded calibration map across different window sizes The

1047297rst step is to calculate the total number of true positives (cells

coded as 3s)in the calibration map (Fig 6A) For a givenwindowof

say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)

and false negative (cells coded as 1s) are considered together as a

correct prediction at that scale and window the number of 3s is

incremented by one for every pair of false positive and false

negative cells The window is then movedone position to theright

(Fig 6B) and pairs of 1s and2s areagain added to the total number

of 3s for that calibration map such that any 1s or 2s already

counted are not considered This moving N N window is passed

across the entire simulation area and the 1047297nal number of 3s

recorded (Fig 6C) The window size is then incremented by 2 (ie

the next window size after a 5 5 would be a 7 7) and after all

of the windows are considered in the map the process is repeated

Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position

to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part

6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 9: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919

(note that the number of 3s is reset to the number of 3s is the

entire calibration map) and the number of 3s saved for that

window size Window sizes that we often plot are between 3 and

101 Fig 6D gives an example PCM across scaleable window sizes

Note in this plot that the PCM begins to exceed 50 around a

window size of 9 9 which for this simulation conducted at

100x 100m means that PCM reaches 50 at 900m 900m The

scaleable window plots are also made for each reduced input

model as well in order to determine the behavior of the training of

the neural network against the goodness of 1047297t of the calibration

maps by input

The 1047297nal step for calibration is the selection of the network 1047297le

(Fig 4 items 16e19) with inputs that best represent land use

change and an assessment of how well the model predicts across

different spatial scales The network 1047297le with the weights bias

activation values are saved for the model with the inputs consid-

ered the best for the model application If the model does not

perform adequately (Fig 4 item 19) the user may consider other

input drivers or dropping drivers which reduce model goodness of

1047297t However if the drivers selected provide a positive contribution

to the goodness of 1047297t and the overall model is deemed adequate

then this network 1047297le is saved and used in the next step model

validation

35 Validation

We follow the recommended procedures of Pontius et al

(2004) and Pontius and Spencer (2005) to validate our model

Brie1047298y we use an independent data set across time to conduct an

historical forecast to compare a simulated map (Fig 4 15) with an

observed historical land use map that was not used to build the

ANN model (Fig 4 20) For example below (Section 46) we

describe how we use a 2006 land use map that was not used to

build the model to compare to a simulated map Validation metrics

(Fig 4 21) include the same as that used for calibration namely

PCM of the entire map or spatial unit sensitivity speci1047297city PCM

across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-

tween time 1 and time 2 for calibration we do so for validation as

well (eg between time 2 and time 3 the number of cells that

changed in the observed maps are used to1047297x the quantity of cells to

change in the simulation that forecasts time 3)

36 Forecasting

We designed the LTM-HPC so that the quantity model (Fig 4

24) of the forecasting component can be executed for any spatial

unit category like government units watersheds or ecoregions or

any spatial unit scale such as states counties or places The

quantity model is developed of 1047298ine using Exceland algorithms that

relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per

person In theapplication described below we execute the model at

several spatial unit scales e cities states and the lower 48 states

Using a combination of unique unit IDs (eg federal information

processing systems (FIPS) codes are used for government unit IDs)

a 1047297le and directory-naming system XML 1047297les and python scripts

the HPC was used to manage jobs and tasks organized by the

unique unit IDs

We next use a program written in C to convert probability

values to binarychange values (0 are cells without change and 1 are

locations of change in prediction map) using input from the

quantity change model (Fig 4 24) The quantity change model

produces a table of the number of cells to grow for each time step

for each spatial unit froma CSV 1047297

le Rowsin the CSV 1047297

le contain the

unique unit IDS and the number of cells to transition for each time

step The program reads the probability map for the spatial unit

(ie a particular city) being simulated counts the number of cells

for each probability value and then sorts the values and counts by

rank The original order is maintained using an index for each re-

cord The probability values with high rank are then converted to

urban (code 1) until the numbers of new urban cells for each unit is

satis1047297ed while other cells (code 0) remain without change A

separate GIS map (Fig 4 25) may be created that would apply

additional exclusionary rules to create an alternative scenario

Output from the model (Fig 4 item 26) is used for planning or

natural resource management (Skole et al 2002 Olson et al 2008)

(Fig 4 item 27) as input to other environmental models (eg Ray

et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al

2010) (Fig 4 item 28) or the production of multimedia products

that can be ported to the internet (Fig 4 item 29)

37 HPC job con 1047297 guration

We developed a coding schema for the purposes of running the

simulation across multiple locations We used a standard

numbering system from the Federal Information Processing Sys-

tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit

code and a county in those states a three-digit code A speci1047297c

county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-

pecanoe County Indiana) and places are given a seven-digit code

two digits for the state and 1047297ve digitsfor the place (eg1882862 for

the city of West Lafayette Indiana)

Con1047297guring HPC jobs and constructing the associated XML 1047297les

can be approached in different ways The 1047297rst is to develop one job

and one XML 1047297le per model simulation component (eg mosaick-

ing individual census place spatial maps into a national map) For

our LTM-HPC application where we would need to mosaic over

20000 census places a job failure for any of the places would result

in the one large job stopping and then addressing the need to

resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number

of jobs and associated XML 1047297les is still manageable A failure of one

census place would require less re-execution and trouble shooting

of that job We often grouped the execution of census place tasks by

state using the FIPS designator for both to assign names for input

and output 1047297les

Five different jobs are part of the LTM-HPC (Fig 7) those for

clipping a large 1047297le into smaller subsets another for mosaicking

smaller 1047297les into one large 1047297le one for controlling the calibration

programs another job for creating forecast maps and a 1047297fth for

controlling data transposing between ASCII 1047298at 1047297les and SNNS

pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-

vide the job into tasks for example our national simulation

described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a

sample Windows jobs manager interface for mosaicking over

20000 places Each topline Fig 7 (item 1) represents an XML for a

region (state) with the status (item 2) Core resources are shown

(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each

task (Fig 7 item 5) within a job We used a python script to create

each of the xml 1047297les although any programming or scripting lan-

guage can be used

We then used an ArcGIS python script to mosaic the ASCII

maps an XML 1047297le that lists 1047297le and path names was used as input

to the python script Mosaicking and clipping are conducted in

ArcGIS using python scripts polygon_clippy and poly-

gon_mosaicpy Both ArcGIS python scripts read the digital spatial

unit codes from a variable in the shape 1047297

le attribute table and

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 10: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019

names 1047297les based on the unit code The resultant mosaicked

suitability map produced from training and data transposing

constitutes a map of the entire study domain Creating such a

suitability map of the entire simulation domain allows us to (1)

import the ASCII 1047297le into ArcGIS in order to inspect and visualize

the suitability map (2) allow the researcher to use different

subsetting and mosaicking spatial units (as we did below) and (3)

allow the researcher to forecast at different spatial units (we also

illustrate this below as well)

4 Execution of LTM-HPC

41 Hardware and software description

We executed the LTM-HPC on three computer systems (Fig 8)

One computer a high-end workstation was used to process inputs

for the modeling using GIS A windows cluster was used to

con1047297gure the LTM-HPC and all of the processing of about a dozen

steps occurred on this computer system A third computer system

stored all of the data for the simulations Speci1047297c con1047297guration of

each computer system follows

Data preparation was performed on a high-end Windows 7

Enterprise 64-bit computer workstation equipped with 24 GB of

RAM a 256 GB solid state hard drive a 2 TB local hard drive and

ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures

used to create each of the data layers for input to the LTM can be

found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)

Brie1047298y data were processed for the entire contiguous United States

at 30m resolution and distance to key features like roads and

streams were processed using the Euclidean Distance tool in Arc-

GIS setting all output to double precision integer given the large

size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the

storage server

The hardware platform on which the parallelization was carried

out was a cluster of HPC consisting of 1047297ve nodes containing a total

of 20 cores Windows Server HPC Edition 2008 was installed on the

HPCC Each node was powered by a pair of dual core AMD Opteron

285 processors and 8 GB of RAM Each machine had two 1 GBs

network adapters with one used for cluster communication and the

other for external cluster communication Each node had 74 GB of

hard drive space that was used for the operating system and soft-

ware but was not used for modeling The HPC cluster used for our

national LTM application consisted of one server (ie head node)

that controls other servers (ie compute nodes) which read and

write data from a data server A cluster is the top-level unit which

Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are

shown in item 3 Item 4 displays the status of each task (item 5) within a job

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 11: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119

is composed of nodes or single physical or logical computers with

one or more cores that include one or more processors All

modeling data was read and written to a storage machine located in

another building and transferred across an intranet with a

maximum of 1 Gigabit bandwidth

The data storage server was composed of 24 two terabyte

7200 RPM drives in a RAID 6 con1047297guration This server also had

Windows 2008 Server R2 installed Spot checks of resource moni-

toring showed that the HPC was not limited by network or disk

access and typically ran in bursts of 100 CPU utilization ArcGIS

100 with the Spatial Analyst extension was installed on all servers

Based on the results of the 1047297le number per folder and the use of

unique unit IDs as part of the 1047297le and directory-naming scheme we

used a hierarchical directory structure as shown in Fig 9 The upper

branches of the directory separate 1047297les into input and output di-

rectories and subfolders store data by type (ASC or PAT 1047297les)

location unit scale (national state) and for forecasts years and

scenarios

Fig 9 Directory structure for the LTM-HPC simulation

Fig 8 Computer systems involved in the LTM-HPC national simulations

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 12: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219

42 Preliminary tests

The primary limitation in 1047297le size comes from SNNS This limit

was reached in the probability map creation phase in several

western US counties when the RES1047297lewhichcontainsthe values

for all of the drivers (eg distance to urban etc) crashed To

overcome this issue we divided the country into grids that pro-

duced 1047297les that SNNS was capable of handling for the steps up to

and including pattern 1047297le creation which is done on a pixel-by-

pixel basis and is not spatially dependent For organization and

performance reasons1047297leswere grouped into folders by stateAs the

SNNS only uses the probability values in the projection phase we

were able to project at the county level

Early tests with mosaicking the entire country at once were

unsuccessful and led to mosaicking by state The number of states

and years of projection for each state made populating the tool

1047297elds in ArcGIS 100 Desktop a time intensive process We used

python scripts to overcome this issue and the HPC to process

multiple years and multiple states at the same time Although it is

possible to run one mosaic operation for each core we found that

running 24 operations on a machine led to corrupted mosaics We

attribute this to the large 1047297le sizes and limited scratch space

(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6

cores for most states and 12 cores for very large states such as CA

and TX

43 Data preparation for national simulation

We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs

for use in training and testing of the neural network Details of the

data preparation can be found elsewhere (Tayyebi et al 2012)

although a brief description of processing and the 1047297les that were

created follow We used the US Census 2000 road network line

work to create two road shape 1047297les highways and main arterials

We used ArcGIS 100 Spatial Analyst to calculate the distance that

each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and

streams distance to primary roads (highways) distance to sec-

ondary roads (roads) and slope

Preparing data for neural net training required the following

steps Land use data from approximately 1990 and 2000 were

collected from 18 different municipalities and 3 states These data

were derived from aerial photography by local government and

were thus deemed to be of high quality Original data were vector

and they were converted to raster using the simulation dimensions

described above Data from states were used to select regions in

rural areas using a random site selection procedure (described in

Tayyebi et al 2012)

Maps of public lands were obtained from ESRI Data Pack 2011

(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national

land cover database) and used to create the exclusionary layer for

the simulation Areas that were not located within the training area

were set to ldquono datardquo in ArcGIS Data from the US census bureau for

places is distributed as point location data We used the point lo-

cations (the centroid of a town city or village) to construct Thiessen

polygons representing the area closest to a particular urban center

(Fig 10) Each place was labeled with the FIPS designated census

place value

We executed the national LTM-HPC at three different spatial

scales and using two different kinds of spatial units ( Tayyebi et al

2012) government and 1047297xed-size tiles The three scales for our

government unit simulations were national county and places

(cities villages and towns)

All input maps were created at a national scale at 30m cell

resolution For training data were subset using ArcGIS on the local

computer workstation and pattern 1047297les created for training and

1047297rst phase testing (ie calibration) We also used the LTM-clippy

Python script to create subsamples for second phase testing In-

puts and the exclusionary maps were clipped by census place and

then written out as ASC 1047297les The createpat 1047297le was executed per

census place to convert the 1047297les from ASC to PAT

44 Pattern recognition simulations for national model

We presented a training 1047297le with 284477 cases (ie records or

locations) to the neural network using a feedforward back propa-

gation algorithm We followed the MSE during training saving this

value every 100 cycles We found that the minimum MSE stabilized

globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was

produced every 100 cycles so that we could analyze the training

later but the network 1047297le for 49500 cycles was saved and used to

estimate output (iepotential fora land usechange to occur at each

location) for testing

Testing occurred at the scale of tiles The LTM-clippy script was

used to create testing pattern 1047297les for each of the 634 tiles The

ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an

RES 1047297le for each tile RES 1047297les contain estimates of the potential

for each location to change land use (values 00 to 10 where closer

Fig 10 Spatial units involved in the LTM-HPC national simulation

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 13: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319

to 10 means higher chance of changing) The RES 1047297les are con-

verted to ASC 1047297les using a C program called convert2asciiexe The

ASC probability maps for all tiles were mosaicked to a national

raster 1047297le using an ArcGIS python script All original values which

range from 00 to 10 are multiplied by 100000 by convert2ascii so

that they may be stored as double precision integer

We used three-digit codes as unique numbers for naming tile

1047297les and tracking them as tasks within states on HPC (634 grids

in conterminous of USA) Each tile contained a maximum of

4000 rows and 4000 columns of 30m pixels We were able to do

this because the steps leading up to prediction work on a per

pixel basis and thus the processing unit did not affect the output

value

45 Calibration of the national simulation

We trained on six neural network versions of the model one

that contained 1047297ve input variables and 1047297ve that contained four

input variables each where we dropped out one input variable from

the full input model We saved the MSE at each 100 cycles through

100000 cycles and then calculated the percent difference of MSE

from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t

during training distance to highways provides the neural network

with the most information necessary for it to 1047297t input and output

data This plot also illustrates how the neural network behaves

between 0 cycles and approximately cycle 23000 the neural

network makes large adjustments in weights and values for acti-

vation function and biases At one point around 7000 cycles the

model does better (ie percentage difference in MSE is negative)

without distance to streams as an input to the training data

Eventually all drop one out models stabilize near 50000 which is

where the full 1047297ve-variable model also stabilizes At this number of

training cycles distance to highway contributes about 2 of the

goodness of 1047297t distance to urban about 15 slope about 12 and

distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve

variables contribute in a positive way toward the goodness of 1047297t

and (2) that 49500 cycles provide enough learning of the full 1047297ve-

variable model to use for validation

The second step of calibration is to examine how well the model

produces spatial maps of change compared to the observed data

(eg Fig 5A) We use the locations of observed change from the

training map that are outside the training locations to create a

01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le

was modi1047297ed to receive the 01234-coded calibration map and

general statistics (eg percentage of each value) are created for the

entire simulation domain and for smaller subunits (eg spatial

units)

46 Validation of the national model

We used the 2006 NLCD urban map and a 2006 forecast map

from the LTM to create a 01234-coded validation map that was

assessed for goodness of 1047297t in several ways two of which are

presented here The 1047297rst goodness of 1047297t metric examined how

well the model predicted the correct number of urban cells per

simulation tile This analysis was not computationally rigorous so

this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to

calculate the amount of area for each of the codes 0 1 2 and 3

The percentage of the amount of urban correctly predicted was

then mapped (Fig 12A) Note that the model predicted the cor-

rect amount of urban cells in most simulation tiles Only a few

along coastal areas contained errors in quantity of urban greater

than 5

The second goodness of 1047297t assessment highlights the use of the

HPC to calculate a computationally rigorous calculation that char-

acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed

to receive the 01234-coded validation map at the spatial unit of

tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable

window routine for each tile from a 3 3 window size through

101

101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged

Fig 11 Drop one out percent difference MSE from full driver model

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 14: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419

with the shape 1047297le for tiles Note (Fig12B) that the model goodness

of 1047297t is best east of the Mississippi River along the west coast and in

certain areas of the central United States where there are large

metropolitan cities (eg Denver) Improvement of the model thus

needs to concentrate on rural areas of the central and western

portions of the United States Similar maps are often constructed

for different window sizes to determine if scale of prediction

changes spatially

47 Forecasting

Forecasting requires merging the suitability map and the

quantity model We used several XML jobs 1047297les to construct the

forecast maps at the national scale We developed our quantity

model (Tayyebi et al 2012) that contained the number of urban

cells to grow for each polygon for 10-year time steps from 2010 to

2060 We considered each state as a job and including all the

Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 15: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519

polygons within the state as different tasks to create forecast maps

of each polygon We embedded the path of prediction code and

number of urban cells to grow for each polygon within

XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each

state on HPC to convert the probability map to forecast map for

each polygon Then we ran the Mosaic_Python script on the HPC

using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at

the polygon level to create forecast maps at state levelSimilarly we

ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-

tional to mosaic prediction pieces at state level to create a national

forecast map HPC also enabled us to export error messages in error

1047297les so that if any of tasks fail in a job using standard out and

standard error 1047297les to have records of program did during execu-

tion We also embedded the path of standard out and standard

error 1047297les in the tasks of the XML jobs 1047297le

Wegenerated decadal maps of land use from 2010 through 2050

from this simulation Maps of new urban (red) superimposed on

2006 land usecover from the USGS National Land Cover Maps for

eight regions are shown in Fig 13 Note that the model produces

different spatial patterns of urbanization depending on the loca-

tion urbanization in the Los AngeleseSan Diego region are more

clumped likely to due to topographic limitations of the area in the

large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast

5 Discussion

We presented an overview of the conversion of a single work-

station land change model that has been converted to operate using

a high performance computer cluster The Land Transformation

Model was originally developedto simulatesmall areas (Pijanowski

et al 2000 2002b) such as watersheds However there is a need

for larger sized land change models especially those that can be

coupled to large-scale process models such as climate change (cf

Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic

models (Yang et al 2010 Mishra et al 2010) We have argued that

to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-

cessing of large databases (2) the management of large numbers of

1047297les (3) the need for a high-level architecture that integrates

model components (4) error checking and (5) the management of

multiple job executions Here we brie1047298y discuss how we addressed

these challenges as well as lessons learned in porting the original

LTM to an HPC environment

51 Challenges of executing large-scale models

We found that the large datasets used for input and output were

dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to

be managed as smaller subsets either as states or regions (ie

multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and

write lines of data at a time rather than read large 1047297les into a large

array This is needed despite a large amount of memory contained

in the HPC

The large number of 1047297les were managed using a standard 1047297le

naming coding system and hierarchical arrangement of folders on

our storage server The coding system also helped us to construct

the xml 1047297le content used by the job manager in Windows 2008

Server R2

The high-level architecture was designed after the proper steps

that have been outlined by prominent land change modeling sci-

entists (Pontius et al 2004 Pontius et al 2008) These include

steps for (1) data sampling from input 1047297les (2) training (3) cali-

bration (4) validation and (5) application Job 1047297

les were

constructed for steps the interfaced each of these modeling steps

In fact we found quickly discovered that the most logical directory

structure mirrored the high-level architecture of the model

We experienced that jobs or tasks can fail because of one of the

following errors (1) one or more tasks in the job have failed This

indicates that one or more tasks could not be run or did not com-

plete successfully We speci1047297ed standard output and standard error

1047297les in the job description to determine which executable 1047297les fail

during execution (2) A nodeassignedto the job ortaskcouldnot be

contacted Jobs or tasks that fail because of a node falling out of

contact are automatically retried a certain numberof times but will

eventually failif the problem continues (3) The run time for a job or

task run expired The job scheduler service cancels jobs or tasks

that reach the end of their run time (4) A 1047297le location required by

the job or task could not be accessed A frequent cause of task

failures is inaccessibility of required 1047297le locations including the

standard input output and error 1047297les and the working directory

locations

52 Lessons learned from converting the LTM to an HPC

The limited number of probability maps created in our simula-

tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output

was stored by state and by year which made mosaicking a time

consuming and an error prone process in some cases we needed

to manually mosaic a few areas as the job manager would crash

The HPC was employed to speed up the mosaicking process but this

was not a fail-safe process A short python script that ran the ArcGIS

mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each

pattern 1047297le derived from boxes that contained all of the cells in the

USA except those within the exclusionary zone Finally states were

manually mosaicked to create the national probability map for USA

(Fig 13)

A windows HPC cluster was used to decrease the time required

to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as

the time it would take to run the LTM-HPC serially When running

the LTM-HPC the amount of time required relative to LTM is

approximately halved for every doubling of cores This variance in

processing time is caused by variance in 1047297le size The HPC also

provides additional bene1047297ts to researchers who are interested in

running large-scale models These include the reduction in the

need for human control of various steps which thereby reduces the

changes of human error It also allows researchers to execute the

model in a variety of con1047297gurations (eg here we were able to run

the model using different spatial units testing issues related to

scale) allowing for researchers to run ensembles

We also found that developing and executing the model across

three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these

helped to manage work 1047298ow and optimize the purpose of each

computer system

53 Needs for land change model forecasts at large extents and 1047297ne

resolution

Models that must simulate large areas at 1047297ne resolutions and

produce output that has multiple time steps require the handling

and management of big data Environmental simulations have

traditionally focused on small spatial extents at 1047297ne resolutions to

produce the required output However environmental problems

are often at large extents and coarse resolution simulations or

alternatively at small extents and 1047297

ne resolutions may hinder the

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 16: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619

ability to assess impacts at the necessary scale Land change models

are often used to assess how human use of the land may impact

ecosystem health It is well known that land use cover change

impacts ecosystem processes at a variety of spatial scales ( Reid

et al 2010 GLP 2005 Lambin and Geist 2006) Some of the

most frequently cited ecosystem impacts include how land use

change at large extents affect the total amount of carbon

sequestered in aboveground plants and soils in a region (eg Dixon

et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers

and Verhagen 2002 Guo and Gifford 2002) how patterns and

amounts of certain land covers (eg forests urban) affect invasive

species spread and distributions (eg Sharov et al 1999 Fei et al

2008) how land surface properties feedback to the atmosphere

through alterations of water and energy 1047298

uxes (eg Dale 1997

Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this

article)

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 17: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719

Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain

land uses such as urban and agriculture increase nutrients and

pollutants to surface and ground water bodies (Pijanowski et al

2002b Tang et al 2005ab) and how land use patterns affect

biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic

ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)

In all cases more urban decreases ecosystem health (cf Pickett

et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye

et al 2006)

Assessment of land use change impacts has often occurred by

coupling land change models to other environmental models For

example the LTM has been coupled to the Regional Atmospheric

Modeling Systems (RAMS) to assess how land use change might

impact precipitation patterns at subcontinental scales in East Africa

(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)

model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-

drologic Impact Assessment (L-THIA) model assess how land use

change might impact overland 1047298ow patterns in large regional wa-

tersheds and how nutrient1047298uxes and pollutants from urban change

would impact stream ecosystem health in large watersheds (Tang

et al 2005ab) The next step in our development will be to

couple the output of this model to a variety of environmental

impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline

above

The LTM-HPC model presented here can also be modi1047297ed to

address other land change transitions For example the LTM-

HPC can be con1047297gured to simulate multiple transitions at a

time this might include the loss of urban along with urban gain

(which is simulated here) or include the numerous land tran-

sitions common to many areas of the world namely the loss of

natural lands like forests to agriculture the shift of agriculture

to urban the loss of forests to urban and the transition of

recently disturbance areas (eg shrubland) to more mature

natural lands like forests To accomplish multiple transitions a

variety of rules need to be explored further to determine how

they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may

need to be applied in the same simulation with rules applied to

areas based on another higher-level rule The LTM-HPC could

also be con1047297gured to simulate subclasses of land use following

Dietzel and Clarke (2006) For example within the urban class

parking lots in the United States cover large extents but are

relatively small areas (Davis et al 2010ab) such an application

could require the LTM-HPC because 1047297ne resolutions would be

needed Likewise simulating crop cover types annually at a

national scale (cf Plourde et al 2013) could product a consid-

erable amount of temporal information At a global scale we

have found that subclasses of land usecover in1047298uence species

diversity patterns especially those vertebrates that are rare and

of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC

The LTM-HPC could also support national or regional scale

environmental programmatic assessments that are becoming more

common supported by national government agencies These

include the 2013 United States National Climate Assessment Pro-

gram (USGCRP 2013) National Ecological Observation Network

(NEON) supported in the United States by the National Science

Foundation (Schimel et al 2007 Kampe et al 2010) and the Great

Lakes RestorationInitiative which seeks to develop State of the Lake

Ecosystem metrics of ecosystem services (SOLEC Bertram et al

2003 WHCEC 2010) In Europe an EU15 set of land use forecasts

have been used extensively to study the impacts of land use and

climate change on this continentrsquos ecosystem services (cf

Rounsevell et al 2006)

54 Calibration and validation of big data simulations

We presented a preliminary assessment of the model goodness

of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-

ment would require more effort placed on (1) 1047297ne resolution ac-

curacy (2) a quanti1047297cation of the variability of 1047297ne resolution

accuracy across the entire simulation (3) errors associated with

forecasting (ie temporal measures of model goodness of 1047297t) (4)

the relative cost of an error (ie whether an error of location is

important to the application) and measures of data input quality

We were able to show that at 3 km scales the error of location

varied considerably across the simulation domain Errors were

greater in the eastern portion of the United States for quantity

Patterns of error were different from quantity they were lower in

the eastern for quantity (Fig 12) Location of errors could be

important too if they affect the policies or outcomes of environ-

mental assessment If policies are being explored to determine the

impact of land use change in stream riparian zones model location

accuracy needs to be good along streams If environmental impacts

are being assessed then covariates such as soil which tends to be

spatially heterogeneous needs to be taken into consideration

Current model goodness of 1047297t metrics have not been designed to

consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment

of how well a model like this performs

6 Conclusions

This paper presents the application of the LTM-HPC at multi-

scale using quantity drivers (a 1047297ne-scale urban land use change

model applied across the conterminous of USA) and introduces a

new version of LTM with substantially augmented functionality We

described a parallel implementation of the data and modeling

process on a cluster of multi-core processors using HPC as a data-

parallel programming framework We focus on ef 1047297ciently

handling the challenges raised by the nature of large datasets and

show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the

nature of the data We signi1047297cantly enhance the training and

testing run of the LTM and enable application of the model for

region scale such as continent Future research will also be able to

use the new information generated by the LTM-HPC to address

questions related to how urban patterns relate to the process of

urban land use change Because we were able to preserve the high-

resolution of the land usedata (30m resolution) LTM-HPC provided

the capability of visualizing alternative future scenarios at a

detailed scale which helped to engage urban planner in the sce-

nario development process We believe this project represents an

important advancement in computational modeling of urban

growth patterns In terms of simulation modeling we have pre-

sented several new advancements in the LTM modelrsquos performance

and capabilities More importantly however this project repre-

sents a successful broad-scale modeling framework that has direct

applications to land use management

Finally we found that the LTM-HPC has some signi1047297cant ad-

vantages over the single workstation version of the LTM These

include

(1) Automated data preparation data can now be clipped and

converted to ASCII format automatically at the state county

or any other division using unique identity for the unit in

Python environment

(2) Better memory usage The source code for the model in C

environment has been changed making calculations per-

formedby LTM-HPCcompletelyindependent from the size of

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 18: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819

the ASCII 1047297les by reading each code line separately into an

array using the C environment

(3) Ability to conduct simultaneous analyses LTM was not

designed to be used for different regions at the same time

LTM-HPC now uses a unique code for different regions in

XML format and can repeat all the processes simultaneously

for different regions

(4) Increased processing speed The previous version of LTM had

many disconnected steps (Pijanowski et al 2002a) which

were carried out sequentially using different DOS-level

commands All XML 1047297les are now uploaded into an HPC

environment and all modeling steps are automatically

processed

References

Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73

Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398

Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20

Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33

Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford

Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2

Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449

Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34

Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369

Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ

Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22

Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324

Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160

Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425

Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187

Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769

DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot

footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77

Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261

Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363

Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101

Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45

Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92

Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208

ESRI 2011 ArcGIS 10 Software

Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70

Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840

Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128

Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400

GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp

Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213

Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360

Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302

Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399

Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-

scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510

Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199

Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719

Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427

Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer

LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32

Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515

Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040

Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e

29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471

MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC

Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799

Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044

Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969

Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7

Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M

Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911

Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918

Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8

Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23

Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199

Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-

formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center

BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919

Page 19: A Big Data Urban Growth big dataSimulation at a National Scale

8122019 A Big Data Urban Growth big dataSimulation at a National Scale

httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919