A Big Data Urban Growth big dataSimulation at a National Scale
Transcript of A Big Data Urban Growth big dataSimulation at a National Scale
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 119
A big data urban growth simulation at a national scale Con1047297guringthe GIS and neural network based Land Transformation Model to runin a High Performance Computing (HPC) environment
Bryan C Pijanowski a Amin Tayyebi ab Jarrod Doucette a Burak K Pekin acDavid Braun de James Plourde af
a Department of Forestry and Natural Resources Purdue University 195 Marsteller Street West Lafayette IN 47907 USAb Department of Entomology University of Wisconsin Madison WI 53706 USAc Institute for Conservation Research San Diego Zoo Global 15600 San Pasqual Valley Road Escondido CA 92027 USA
d Rosen Center for Advanced Computing Information Technology Division Purdue University West Lafayette IN 47907 USAe Thavron Solutions Kokomo IN 46906 USAf Worldwide Construction and Foresy Division John Deere 1515 5th Avenue Moine IL 61265 USA
a r t i c l e i n f o
Article history
Received 5 April 2013
Received in revised form
18 September 2013
Accepted 23 September 2013
Available online 7 November 2013
Keywords
Land use land cover change
Big data simulationLand Transformation Model
High Performance Computing
Extensible Markup Language
Python environment
Visual Studio 10 (C)
Continental scale
a b s t r a c t
The Land Transformation Model (LTM) is a Land Use Land Cover Change (LUCC) model which was
originally developed to simulate local scale LUCC patterns The model uses a commercial windows-based
GIS program to process and manage spatial data and an arti1047297cial neural network (ANN) program within a
series of batch routines to learn about spatial patterns in data In this paper we provide an overview of a
redesigned LTM capable of running at continental scales and at a 1047297ne (30m) resolution using a new
architecture that employs a windows-based High Performance Computing (HPC) cluster This paper
provides an overview of the new architecture which we discuss within the context of modeling LUCC
that requires (1) using an HPC to run a modi1047297ed version of our LTM (2) managing large datasets in
terms of size and quantity of 1047297
les (3) integration of tools that are executed using different scriptinglanguages and (4) a large number of steps necessitating several aspects of job management
2013 Elsevier Ltd All rights reserved
1 Introduction
The Land Transformation Model was developed over 1047297fteen
years ago (Pijanowski et al 1997 2000 2002ab) to simulate spatial
patterns of land use land cover change (LUCC) over time The model
uses geographic information systems (GIS) to process and managespatial data layers and arti1047297cial neuralnetwork (ANN) tools to learn
about patterns in input (iedrivers) and output (eg historical land
use change) data The model has been used to forecast LUCC pat-
terns in a variety of places around the world such as the Midwest
USA (Pijanowski et al 2005) central Europe (Pijanowski et al
2006) East Africa (Olson et al 2008 Washington-Ottombre
et al 2010 Pijanowski et al 2011) and Asia (Pijanowski et al
2009) Forecasts are often linked to climate (Moore et al 2010
2011) hydrologic (Tang et al 2005ab Yang et al 2010) or bio-
logical (Wiley et al 2010) models to examine how whateif LUCC
scenarios impact the environment (eg Ray et al 2011) andor
economics (Skole et al 2002) The LTM has even been engineered
to run ldquobackwardsrdquo (Ray and Pijanowski 2010) in order to examineenvironmental impacts of historical LUCC or the effects of land use
legacies on slow environmental processes such as groundwater
transport through watersheds (Wayland et al 2002 Pijanowski
et al 2007 Ray et al 2012) The LTM has been recently extended
to simulate and predict urban boundary change (Pijanowski et al
2009 Tayyebi and Perry 2013) which can be used by urban
planners and managers interested in the control of urban growth
Modeling especially in a spatially explicit way allows for con-
ducting experiments that quantify the importance of various LUCC
drivers contributing to a better understanding of key LUCC pro-
cesses (Veldkamp and Lambin 2001 Burton et al 2008 Pontius Corresponding author Tel thorn1 765 496 2215
E-mail address bpijanowpurdueedu (BC Pijanowski)
Contents lists available at ScienceDirect
Environmental Modelling amp Software
j o u r n a l h o m e p a g e w w w e l s e v i e r c o m l o c a t e e n v s o f t
1364-8152$ e see front matter 2013 Elsevier Ltd All rights reserved
httpdxdoiorg101016jenvsoft201309015
Environmental Modelling amp Software 51 (2014) 250e268
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 219
and Petrova 2010 Anselme et al 2010 Peacuterez-Vega et al 2012)
Large-scale LUCC models are needed to understand regional con-
tinental to global scale problems like climate change (Chapman
1998 Kilsby et al 2007 Merritt et al 2003 ) human impacts to
ecosystem services (MEA 2005) alterations to carbon sequestra-
tion (Post and Kwon 2000) and dynamics of biogeochemical
cycling (Boutt et al 2001 Pijanowski et al 2002b Wayland et al
2002 Turner et al 2003 GLP 2005 Fitzpatrick et al 2007
Anselme et al 2010 Carpani et al 2012) One of the characteris-
tics of all LTM applications to date is that the size of the simulation
has been small enough to run on a single advanced workstation
However as models originally designed for local to regional sim-
ulations are needed at continental scales or larger a redesign of
single workstation models such as the LTM becomes necessary
Indeed recent calls by the scienti1047297c and policy community for the
development and use of large-scale models (eg Earth Systems
Science community eg Randerson et al 2009 Xue et al 2010
Lagabrielle et al 2010) underscores the importance of focusing
attention on large-scale environmental modeling
Increasing the size of any computer simulation like the LTM
has several challenges (Herold et al 2003 2005 Lei et al 2005
Dietzel and Clarke 2007 Clarke et al 2007 Adeloye et al 2012)
First is the need to manage large datasets that are used as inputsand are output by the model (Yang 2011 Loepfe et al 2011) The
national-scale application of the LTM that we present here sim-
ulates LUCC for the lower 48 states of the USA at 30m resolution
This represents a simulation domain of 154 105 by 975 104
cells (ie over 15 billion cells) Additionally as many as 10 spatial
drivers are used per simulation and up to 10 forecast maps (a 50-
year projection with 5-year time steps) are created Forecasts may
also involve multiple forecast scenarios with multiple time steps
a recent LTM application (Ray et al 2011) compared 36 different
LTM forecast scenarios for a single regional watershed for 2000
through 2070 at 1047297ve-year time steps Thus the number of cells
within each simulation can be very large and can easily exceed
one trillion A second challenge presented by modeling large re-
gions is the need to create and manage a large number of 1047297les in avariety of formats For the LTM this requires managing input
programs and tools and output 1047297les for GIS and neural network
software For the national-scale LTM described below we used the
GIS to split the simulation into over 20000 census-designated
places (eg towns villages and cities) which were then stored
in folders in a hierarchical structure Thus standards for 1047297le
naming and use in automated routines are necessary to properly
manage numerous 1047297les Third since we are using a variety of
tools such as ESRIrsquos ArcGIS Desktop and Stuttgart Neural Network
Simulator (SNNS) each with their own scripting language a
higher-level architecture that automates the control of multiple
programs is needed with such models Fourth since many exe-
cutions occur during the simulation knowing when failure occurs
is necessary and thus the status and progress of the simulationneeds to be tracked Finally given that the LTM contains
numerous programs and scripts a way to manage the processing
of jobs becomes necessary All of these challenges are inherent in
what some scienti1047297c communities call the big data problem
(Lynch 2008 Hey 2009 Jacobs 2009 LaValle et al 2011) These
challenges require solutions that are different from simulations
that are executed on single workstations
In this paper we describe how we have con1047297gured a single
workstation version of the LTM to run in a Windows-based High
Performance Computing (HPC) environment for a version of the
Land Transformation Model we call the LTM-HPC We summarize
the important architectural features of this version of the model
providing both 1047298ow diagrams of the processing steps maps of
data layers used in the simulation as well as pseudo-code that
illustrates how 1047297les and routines are handled This paper will
assist others who are interested in (1) using arti1047297cial neural
networks to learn about patterns in spatial data where data are
large andor (2) using HPC tools to recon1047297gure an environmental
model composed of a series of programs not linked to a graphical
user interface
We organize the remainder of this paper as follows Section 2
provides an overview of the original LTM introducing basic
modeling terms summarizing important features of high perfor-
mance compute (HPC) environment and the architecture of the
current LTM as it is con1047297gured for an HPC Section 3 describes a
speci1047297c application of the LTM-HPC run at a national scale for urban
change for the conterminous USA The paper concludes by discus-
sing the potential of the LTM-HPC for simulating 1047297ne resolution
urban growth patterns at large regional scales as well as the use-
fulness of such projections
2 Brief background
21 Overview of the Land Transformation Model (LTM) and
Arti 1047297cial Neural Networks
The LTM (Pijanowski et al 2000 2002a 2009) simulates land
usecover change based on socio-economic and bio-physical factors
using an Arti1047297cial Neural Network (ANN) and a raster GIS modeling
environment Its previous as well as the current architecture
summarized here is based on scripts and a sequential series of
executable programs that provide considerable 1047298exibility in
running the model There are no graphical user interfaces for the
modelAt the highest level of organization (Fig1) the LTM contains
six major components (1) a data preparation set of routines and
procedures many of which are conducted in a GIS (2) a series of
steps called pattern recognition that allow an arti1047297cial neural
network to learn about patterns in input (drivers of land use
change) and output (historical change or no change in land use)
data which are then applied to an independent set of data and
output values are estimated and (3) a sequence of C and GISbased programs for model calibration (4) an independent
assessment of model performance or model validation also writ-
ten in C and GIS (5) routines used those for creating future
scenarios of land use and (6) model products and applications
conducted within a GIS
In the LTM we use a multi-layer perceptron (MLP) ANN within
the Stuttgart Neural Network Simulator (SNNS) software to
approximate an unknown relation between input (eg drivers of
change) and output (eg locations of change and no change)
Typical inputs include distance to roads slope and distance to
Fig 1 Main components of the Land Transformation Model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 251
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319
previous urban (Fig 2) Outputs are binary values of change (1) and
no change (0) in observed land use maps Input values are fed
through a hidden layer with the number of nodes equal to that of
inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses
learning rules to determine the weights values for bias and acti-
vation function to 1047297t input and output values (Fig 2) of a dataset
Delta rules are used to adjust all of these values across successive
passes of the data each pass iscalled a cycle A mean square error is
calculated for each cycle from a back propagation algorithm
(Bishop 1995 Dlamini 2008) values for weights bias and activa-
tion function are then adjusted and the training stopped after a
global minimum MSE is reached The process of cycling through is
called training In the LTM we use a small randomly selected
subsample (between 5 and 10) of the data to train Applying the
weights values for bias and the activation functions from a training
run to another dataset that contain inputs only in order to estimate
output is referring to as testing We conduct a double phase testing
with the LTM (Tayyebi et al 2012) at large scales (eg contermi-
nous of USA) The 1047297rst phase of testing is to use the weights bias
and activation values saved from the training of the subsample and
apply the values to the entire dataset A set of goodness of 1047297t sta-
tistics are generatedbetween the predicted and observed maps we
also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input
data layer is held back from the same testing dataset and goodness
of 1047297t of the reduced input models is compared against the full
complement model (see Pijanowski et al 2002a) Thus for the LTM
simulation below we used 1047297ve input maps to predict one map of
urban change We held one out at a time to produce 1047297ve input map
models with the same urban change map These reduced input
models are compared with the full complement model of six input
maps
We follow the recommendation of Pontius et al (2004) and
Pontius and Spencer (2005) of validating the model e our second
phase of testing e which is done with a different dataset than what
is used for the 1047297rst phase of testing This independent dataset can
be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive
ability) It is typical practice (Bishop 1995) to use different data for
training and testing which is done here
Forecasting is accomplished using a quantity model developed
using per capita land use growth rates and a population growth
estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)
The quantity model can be applied across the entire simulation
domain with one quantity estimate per time step or the quantity
model can be applied to smaller spatial units across the simulation
domain as done here
22 High Performance Computing (HPC)
High Performance Computing (HPC) integrates computer ar-
chitecture design principles operating system software heteroge-
neous hardware components programs algorithms and
specialized computational approaches to address the handling of
tasks not possible or practical with a single computer workstation
(Foster and Kesselman 1997 Foster et al 2002) A self-contained
HPC (ie a group of computers) is often referred to as a high per-
formance compute cluster (HPCC) (cf Cheung and Reeves 1992
Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of
HPCs is the integration of hardware and software systems that are
con1047297gured to parse large processing jobs into smaller parallel tasks
Hardware resources can be managed at the level of cores (a single
processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual
servers or computers that contain one or more sockets) The HPCC
employed here is speci1047297cally con1047297gured to control the execution of
several batch 1047297les executable programs and scripts for thousands
of input and output data 1047297les An HPCC is managed by an admin-
istrator with hardware and software services accessible to many
users HPCCs are systems smaller than supercomputers although
the term HPC and supercomputer are often used interchangeably
We controlled all our LTM-HPC programs on the HPCC using
Windows HPC 2008 Server R2 job manager which has features
common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and
tasks allocating resources dispatching the tasks to the compute
nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language
(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa
resource request that is submitted to the job scheduler service that
Estimated error from observed
data (back propagation errors)
Assign weights bias and activation
function to estimate output
A pass forward and back is
called a cycle or epoch
Slope
Distance to stream
Distance to urban
Distance to primary road
Distance to secondary road
Presence = 1 or absence = 0
of a land use transition
Output NodesHidden NodesInput Nodes
Fig 2 Structure of an arti1047297
cial neural network
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419
assigns hardware resources to all tasks and (4) a task which is a
command (ie a program or script) with path names for input and
output 1047297les and software resources assigned for each task Many
jobs and all tasks are run in parallel across multiple processors The
HPC job manager is the primary interface for submitting LTM-HPC
jobs to a cluster it uses a graphical user interface Jobs are also
submitted from a remote desktop using a client utility in Microsoft
Windows HPC Pack 2008
Fig 3 shows sample lines from an XML job description 1047297le used
below to create forecast maps by state Note that the highest level
contains job parameters parameters are passed to the HPC Server
for project name user name job type types and level of hardware
resources for the job etc Tasks are listed after as dependencies to
the higher-level job tasks here contain several parameters (eg
how the hardware resources are used) and commands (eg name
of the Python script to execute and the parameters for that script
such as the name of the input and output 1047297le names)
3 Architecture of the LTM and LTM-HPC
31 Main components
Several modi1047297cations were made to the LTM to make the LTM-
HPC run at larger spatial scales (ie larger datasets) and with 1047297ne
resolution Below we describe the structure of the components
that comprise the current version of the LTM (hereafter as the
single workstation LTM) and the features that were necessary to
recon1047297gure it for an HPCC There are several different kinds of
programming environments that comprise the single worksta-
tion LTM The 1047297rst are command-line interpreter instructions
con1047297gured as batch 1047297les for use in the Windows operating sys-
tem these are named using the BAT extension Batch 1047297les
control most of the processing of data for Stuttgart Neural
Network Simulator (SNNS) A second type of programming
environment that comprises the LTM are compiled programs
written to accept environment variables as inputs Programs are
written in C or C programming language as a standalone EXE
1047297le to be executed at the command line The environment vari-
ables for these programs are often the location and name of input
and output 1047297les Complied programs are used to transpose data
structures and to calculate very speci1047297c values during model
calibration The third kind of program environment is the script
environment written to execute application-speci1047297c tools
Application-speci1047297c scripts that we use here are ArcGIS Python
(version 26 or higher) scripts which call certain features and
commands of ArcGIS and Spatial Analyst A fourth type of soft-
ware environment is the XML jobs 1047297le these are used by the
Windows 2008 Server R2 job manager of the LTM-HPC to execute
and organize the batch routines compiled programs and scripts
in the proper order and with the necessary environment vari-
ables This fourth kind of software environment the XML jobs
1047297le is only present in the LTM-HPC
Fig 4 shows the sequence of batch routines programs and
scripts that comprise the LTM currently organized into the six main
model components data preparation pattern recognition cali-
bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing
how these features enable us to simulate land use cover change at a
national scale those batch routines and programs that have been
modi1047297ed for running in the HPC environment and con1047297gured using
XML job 1047297les are contained in the red boxes in Fig 4
32 Data preparation
Data preparation for training and testing runs in the LTM and
LTM-HPC is conducted using a GIS and spatial databases ( Fig 4
Fig 3 XML job 1047297
le illustrating the syntax for job parameters task parameters and task commands
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519
Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege
article)
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 219
and Petrova 2010 Anselme et al 2010 Peacuterez-Vega et al 2012)
Large-scale LUCC models are needed to understand regional con-
tinental to global scale problems like climate change (Chapman
1998 Kilsby et al 2007 Merritt et al 2003 ) human impacts to
ecosystem services (MEA 2005) alterations to carbon sequestra-
tion (Post and Kwon 2000) and dynamics of biogeochemical
cycling (Boutt et al 2001 Pijanowski et al 2002b Wayland et al
2002 Turner et al 2003 GLP 2005 Fitzpatrick et al 2007
Anselme et al 2010 Carpani et al 2012) One of the characteris-
tics of all LTM applications to date is that the size of the simulation
has been small enough to run on a single advanced workstation
However as models originally designed for local to regional sim-
ulations are needed at continental scales or larger a redesign of
single workstation models such as the LTM becomes necessary
Indeed recent calls by the scienti1047297c and policy community for the
development and use of large-scale models (eg Earth Systems
Science community eg Randerson et al 2009 Xue et al 2010
Lagabrielle et al 2010) underscores the importance of focusing
attention on large-scale environmental modeling
Increasing the size of any computer simulation like the LTM
has several challenges (Herold et al 2003 2005 Lei et al 2005
Dietzel and Clarke 2007 Clarke et al 2007 Adeloye et al 2012)
First is the need to manage large datasets that are used as inputsand are output by the model (Yang 2011 Loepfe et al 2011) The
national-scale application of the LTM that we present here sim-
ulates LUCC for the lower 48 states of the USA at 30m resolution
This represents a simulation domain of 154 105 by 975 104
cells (ie over 15 billion cells) Additionally as many as 10 spatial
drivers are used per simulation and up to 10 forecast maps (a 50-
year projection with 5-year time steps) are created Forecasts may
also involve multiple forecast scenarios with multiple time steps
a recent LTM application (Ray et al 2011) compared 36 different
LTM forecast scenarios for a single regional watershed for 2000
through 2070 at 1047297ve-year time steps Thus the number of cells
within each simulation can be very large and can easily exceed
one trillion A second challenge presented by modeling large re-
gions is the need to create and manage a large number of 1047297les in avariety of formats For the LTM this requires managing input
programs and tools and output 1047297les for GIS and neural network
software For the national-scale LTM described below we used the
GIS to split the simulation into over 20000 census-designated
places (eg towns villages and cities) which were then stored
in folders in a hierarchical structure Thus standards for 1047297le
naming and use in automated routines are necessary to properly
manage numerous 1047297les Third since we are using a variety of
tools such as ESRIrsquos ArcGIS Desktop and Stuttgart Neural Network
Simulator (SNNS) each with their own scripting language a
higher-level architecture that automates the control of multiple
programs is needed with such models Fourth since many exe-
cutions occur during the simulation knowing when failure occurs
is necessary and thus the status and progress of the simulationneeds to be tracked Finally given that the LTM contains
numerous programs and scripts a way to manage the processing
of jobs becomes necessary All of these challenges are inherent in
what some scienti1047297c communities call the big data problem
(Lynch 2008 Hey 2009 Jacobs 2009 LaValle et al 2011) These
challenges require solutions that are different from simulations
that are executed on single workstations
In this paper we describe how we have con1047297gured a single
workstation version of the LTM to run in a Windows-based High
Performance Computing (HPC) environment for a version of the
Land Transformation Model we call the LTM-HPC We summarize
the important architectural features of this version of the model
providing both 1047298ow diagrams of the processing steps maps of
data layers used in the simulation as well as pseudo-code that
illustrates how 1047297les and routines are handled This paper will
assist others who are interested in (1) using arti1047297cial neural
networks to learn about patterns in spatial data where data are
large andor (2) using HPC tools to recon1047297gure an environmental
model composed of a series of programs not linked to a graphical
user interface
We organize the remainder of this paper as follows Section 2
provides an overview of the original LTM introducing basic
modeling terms summarizing important features of high perfor-
mance compute (HPC) environment and the architecture of the
current LTM as it is con1047297gured for an HPC Section 3 describes a
speci1047297c application of the LTM-HPC run at a national scale for urban
change for the conterminous USA The paper concludes by discus-
sing the potential of the LTM-HPC for simulating 1047297ne resolution
urban growth patterns at large regional scales as well as the use-
fulness of such projections
2 Brief background
21 Overview of the Land Transformation Model (LTM) and
Arti 1047297cial Neural Networks
The LTM (Pijanowski et al 2000 2002a 2009) simulates land
usecover change based on socio-economic and bio-physical factors
using an Arti1047297cial Neural Network (ANN) and a raster GIS modeling
environment Its previous as well as the current architecture
summarized here is based on scripts and a sequential series of
executable programs that provide considerable 1047298exibility in
running the model There are no graphical user interfaces for the
modelAt the highest level of organization (Fig1) the LTM contains
six major components (1) a data preparation set of routines and
procedures many of which are conducted in a GIS (2) a series of
steps called pattern recognition that allow an arti1047297cial neural
network to learn about patterns in input (drivers of land use
change) and output (historical change or no change in land use)
data which are then applied to an independent set of data and
output values are estimated and (3) a sequence of C and GISbased programs for model calibration (4) an independent
assessment of model performance or model validation also writ-
ten in C and GIS (5) routines used those for creating future
scenarios of land use and (6) model products and applications
conducted within a GIS
In the LTM we use a multi-layer perceptron (MLP) ANN within
the Stuttgart Neural Network Simulator (SNNS) software to
approximate an unknown relation between input (eg drivers of
change) and output (eg locations of change and no change)
Typical inputs include distance to roads slope and distance to
Fig 1 Main components of the Land Transformation Model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 251
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319
previous urban (Fig 2) Outputs are binary values of change (1) and
no change (0) in observed land use maps Input values are fed
through a hidden layer with the number of nodes equal to that of
inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses
learning rules to determine the weights values for bias and acti-
vation function to 1047297t input and output values (Fig 2) of a dataset
Delta rules are used to adjust all of these values across successive
passes of the data each pass iscalled a cycle A mean square error is
calculated for each cycle from a back propagation algorithm
(Bishop 1995 Dlamini 2008) values for weights bias and activa-
tion function are then adjusted and the training stopped after a
global minimum MSE is reached The process of cycling through is
called training In the LTM we use a small randomly selected
subsample (between 5 and 10) of the data to train Applying the
weights values for bias and the activation functions from a training
run to another dataset that contain inputs only in order to estimate
output is referring to as testing We conduct a double phase testing
with the LTM (Tayyebi et al 2012) at large scales (eg contermi-
nous of USA) The 1047297rst phase of testing is to use the weights bias
and activation values saved from the training of the subsample and
apply the values to the entire dataset A set of goodness of 1047297t sta-
tistics are generatedbetween the predicted and observed maps we
also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input
data layer is held back from the same testing dataset and goodness
of 1047297t of the reduced input models is compared against the full
complement model (see Pijanowski et al 2002a) Thus for the LTM
simulation below we used 1047297ve input maps to predict one map of
urban change We held one out at a time to produce 1047297ve input map
models with the same urban change map These reduced input
models are compared with the full complement model of six input
maps
We follow the recommendation of Pontius et al (2004) and
Pontius and Spencer (2005) of validating the model e our second
phase of testing e which is done with a different dataset than what
is used for the 1047297rst phase of testing This independent dataset can
be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive
ability) It is typical practice (Bishop 1995) to use different data for
training and testing which is done here
Forecasting is accomplished using a quantity model developed
using per capita land use growth rates and a population growth
estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)
The quantity model can be applied across the entire simulation
domain with one quantity estimate per time step or the quantity
model can be applied to smaller spatial units across the simulation
domain as done here
22 High Performance Computing (HPC)
High Performance Computing (HPC) integrates computer ar-
chitecture design principles operating system software heteroge-
neous hardware components programs algorithms and
specialized computational approaches to address the handling of
tasks not possible or practical with a single computer workstation
(Foster and Kesselman 1997 Foster et al 2002) A self-contained
HPC (ie a group of computers) is often referred to as a high per-
formance compute cluster (HPCC) (cf Cheung and Reeves 1992
Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of
HPCs is the integration of hardware and software systems that are
con1047297gured to parse large processing jobs into smaller parallel tasks
Hardware resources can be managed at the level of cores (a single
processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual
servers or computers that contain one or more sockets) The HPCC
employed here is speci1047297cally con1047297gured to control the execution of
several batch 1047297les executable programs and scripts for thousands
of input and output data 1047297les An HPCC is managed by an admin-
istrator with hardware and software services accessible to many
users HPCCs are systems smaller than supercomputers although
the term HPC and supercomputer are often used interchangeably
We controlled all our LTM-HPC programs on the HPCC using
Windows HPC 2008 Server R2 job manager which has features
common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and
tasks allocating resources dispatching the tasks to the compute
nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language
(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa
resource request that is submitted to the job scheduler service that
Estimated error from observed
data (back propagation errors)
Assign weights bias and activation
function to estimate output
A pass forward and back is
called a cycle or epoch
Slope
Distance to stream
Distance to urban
Distance to primary road
Distance to secondary road
Presence = 1 or absence = 0
of a land use transition
Output NodesHidden NodesInput Nodes
Fig 2 Structure of an arti1047297
cial neural network
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419
assigns hardware resources to all tasks and (4) a task which is a
command (ie a program or script) with path names for input and
output 1047297les and software resources assigned for each task Many
jobs and all tasks are run in parallel across multiple processors The
HPC job manager is the primary interface for submitting LTM-HPC
jobs to a cluster it uses a graphical user interface Jobs are also
submitted from a remote desktop using a client utility in Microsoft
Windows HPC Pack 2008
Fig 3 shows sample lines from an XML job description 1047297le used
below to create forecast maps by state Note that the highest level
contains job parameters parameters are passed to the HPC Server
for project name user name job type types and level of hardware
resources for the job etc Tasks are listed after as dependencies to
the higher-level job tasks here contain several parameters (eg
how the hardware resources are used) and commands (eg name
of the Python script to execute and the parameters for that script
such as the name of the input and output 1047297le names)
3 Architecture of the LTM and LTM-HPC
31 Main components
Several modi1047297cations were made to the LTM to make the LTM-
HPC run at larger spatial scales (ie larger datasets) and with 1047297ne
resolution Below we describe the structure of the components
that comprise the current version of the LTM (hereafter as the
single workstation LTM) and the features that were necessary to
recon1047297gure it for an HPCC There are several different kinds of
programming environments that comprise the single worksta-
tion LTM The 1047297rst are command-line interpreter instructions
con1047297gured as batch 1047297les for use in the Windows operating sys-
tem these are named using the BAT extension Batch 1047297les
control most of the processing of data for Stuttgart Neural
Network Simulator (SNNS) A second type of programming
environment that comprises the LTM are compiled programs
written to accept environment variables as inputs Programs are
written in C or C programming language as a standalone EXE
1047297le to be executed at the command line The environment vari-
ables for these programs are often the location and name of input
and output 1047297les Complied programs are used to transpose data
structures and to calculate very speci1047297c values during model
calibration The third kind of program environment is the script
environment written to execute application-speci1047297c tools
Application-speci1047297c scripts that we use here are ArcGIS Python
(version 26 or higher) scripts which call certain features and
commands of ArcGIS and Spatial Analyst A fourth type of soft-
ware environment is the XML jobs 1047297le these are used by the
Windows 2008 Server R2 job manager of the LTM-HPC to execute
and organize the batch routines compiled programs and scripts
in the proper order and with the necessary environment vari-
ables This fourth kind of software environment the XML jobs
1047297le is only present in the LTM-HPC
Fig 4 shows the sequence of batch routines programs and
scripts that comprise the LTM currently organized into the six main
model components data preparation pattern recognition cali-
bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing
how these features enable us to simulate land use cover change at a
national scale those batch routines and programs that have been
modi1047297ed for running in the HPC environment and con1047297gured using
XML job 1047297les are contained in the red boxes in Fig 4
32 Data preparation
Data preparation for training and testing runs in the LTM and
LTM-HPC is conducted using a GIS and spatial databases ( Fig 4
Fig 3 XML job 1047297
le illustrating the syntax for job parameters task parameters and task commands
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519
Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege
article)
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 319
previous urban (Fig 2) Outputs are binary values of change (1) and
no change (0) in observed land use maps Input values are fed
through a hidden layer with the number of nodes equal to that of
inputs (see Pijanowski et al 2002a Mas et al 2004) The ANN uses
learning rules to determine the weights values for bias and acti-
vation function to 1047297t input and output values (Fig 2) of a dataset
Delta rules are used to adjust all of these values across successive
passes of the data each pass iscalled a cycle A mean square error is
calculated for each cycle from a back propagation algorithm
(Bishop 1995 Dlamini 2008) values for weights bias and activa-
tion function are then adjusted and the training stopped after a
global minimum MSE is reached The process of cycling through is
called training In the LTM we use a small randomly selected
subsample (between 5 and 10) of the data to train Applying the
weights values for bias and the activation functions from a training
run to another dataset that contain inputs only in order to estimate
output is referring to as testing We conduct a double phase testing
with the LTM (Tayyebi et al 2012) at large scales (eg contermi-
nous of USA) The 1047297rst phase of testing is to use the weights bias
and activation values saved from the training of the subsample and
apply the values to the entire dataset A set of goodness of 1047297t sta-
tistics are generatedbetween the predicted and observed maps we
also refer to this testing phase as model calibration Model cali-bration also involves a ldquohold one outrdquo procedure where each input
data layer is held back from the same testing dataset and goodness
of 1047297t of the reduced input models is compared against the full
complement model (see Pijanowski et al 2002a) Thus for the LTM
simulation below we used 1047297ve input maps to predict one map of
urban change We held one out at a time to produce 1047297ve input map
models with the same urban change map These reduced input
models are compared with the full complement model of six input
maps
We follow the recommendation of Pontius et al (2004) and
Pontius and Spencer (2005) of validating the model e our second
phase of testing e which is done with a different dataset than what
is used for the 1047297rst phase of testing This independent dataset can
be another land use map that was derived from a different source(ie test of generalization) or another year (ie test of predictive
ability) It is typical practice (Bishop 1995) to use different data for
training and testing which is done here
Forecasting is accomplished using a quantity model developed
using per capita land use growth rates and a population growth
estimate model (cf Pijanowski et al 2002a Tayyebi et al 2012)
The quantity model can be applied across the entire simulation
domain with one quantity estimate per time step or the quantity
model can be applied to smaller spatial units across the simulation
domain as done here
22 High Performance Computing (HPC)
High Performance Computing (HPC) integrates computer ar-
chitecture design principles operating system software heteroge-
neous hardware components programs algorithms and
specialized computational approaches to address the handling of
tasks not possible or practical with a single computer workstation
(Foster and Kesselman 1997 Foster et al 2002) A self-contained
HPC (ie a group of computers) is often referred to as a high per-
formance compute cluster (HPCC) (cf Cheung and Reeves 1992
Buyya 1999 Reinefeld and Lindenstruth 2001) A main feature of
HPCs is the integration of hardware and software systems that are
con1047297gured to parse large processing jobs into smaller parallel tasks
Hardware resources can be managed at the level of cores (a single
processing unit capable of performing work) sockets (a group of cores that have direct access to memory) and nodes (individual
servers or computers that contain one or more sockets) The HPCC
employed here is speci1047297cally con1047297gured to control the execution of
several batch 1047297les executable programs and scripts for thousands
of input and output data 1047297les An HPCC is managed by an admin-
istrator with hardware and software services accessible to many
users HPCCs are systems smaller than supercomputers although
the term HPC and supercomputer are often used interchangeably
We controlled all our LTM-HPC programs on the HPCC using
Windows HPC 2008 Server R2 job manager which has features
common to all jobmanagers The serverjob manager contains(1) a job scheduler service that is responsible for queuing jobs and
tasks allocating resources dispatching the tasks to the compute
nodes and monitoring the status of the job tasks and nodes (2) a job description 1047297le con1047297gured as an Extensible Markup Language
(XML) 1047297le listing both job or task speci1047297cations (3) a job whichisa
resource request that is submitted to the job scheduler service that
Estimated error from observed
data (back propagation errors)
Assign weights bias and activation
function to estimate output
A pass forward and back is
called a cycle or epoch
Slope
Distance to stream
Distance to urban
Distance to primary road
Distance to secondary road
Presence = 1 or absence = 0
of a land use transition
Output NodesHidden NodesInput Nodes
Fig 2 Structure of an arti1047297
cial neural network
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268252
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419
assigns hardware resources to all tasks and (4) a task which is a
command (ie a program or script) with path names for input and
output 1047297les and software resources assigned for each task Many
jobs and all tasks are run in parallel across multiple processors The
HPC job manager is the primary interface for submitting LTM-HPC
jobs to a cluster it uses a graphical user interface Jobs are also
submitted from a remote desktop using a client utility in Microsoft
Windows HPC Pack 2008
Fig 3 shows sample lines from an XML job description 1047297le used
below to create forecast maps by state Note that the highest level
contains job parameters parameters are passed to the HPC Server
for project name user name job type types and level of hardware
resources for the job etc Tasks are listed after as dependencies to
the higher-level job tasks here contain several parameters (eg
how the hardware resources are used) and commands (eg name
of the Python script to execute and the parameters for that script
such as the name of the input and output 1047297le names)
3 Architecture of the LTM and LTM-HPC
31 Main components
Several modi1047297cations were made to the LTM to make the LTM-
HPC run at larger spatial scales (ie larger datasets) and with 1047297ne
resolution Below we describe the structure of the components
that comprise the current version of the LTM (hereafter as the
single workstation LTM) and the features that were necessary to
recon1047297gure it for an HPCC There are several different kinds of
programming environments that comprise the single worksta-
tion LTM The 1047297rst are command-line interpreter instructions
con1047297gured as batch 1047297les for use in the Windows operating sys-
tem these are named using the BAT extension Batch 1047297les
control most of the processing of data for Stuttgart Neural
Network Simulator (SNNS) A second type of programming
environment that comprises the LTM are compiled programs
written to accept environment variables as inputs Programs are
written in C or C programming language as a standalone EXE
1047297le to be executed at the command line The environment vari-
ables for these programs are often the location and name of input
and output 1047297les Complied programs are used to transpose data
structures and to calculate very speci1047297c values during model
calibration The third kind of program environment is the script
environment written to execute application-speci1047297c tools
Application-speci1047297c scripts that we use here are ArcGIS Python
(version 26 or higher) scripts which call certain features and
commands of ArcGIS and Spatial Analyst A fourth type of soft-
ware environment is the XML jobs 1047297le these are used by the
Windows 2008 Server R2 job manager of the LTM-HPC to execute
and organize the batch routines compiled programs and scripts
in the proper order and with the necessary environment vari-
ables This fourth kind of software environment the XML jobs
1047297le is only present in the LTM-HPC
Fig 4 shows the sequence of batch routines programs and
scripts that comprise the LTM currently organized into the six main
model components data preparation pattern recognition cali-
bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing
how these features enable us to simulate land use cover change at a
national scale those batch routines and programs that have been
modi1047297ed for running in the HPC environment and con1047297gured using
XML job 1047297les are contained in the red boxes in Fig 4
32 Data preparation
Data preparation for training and testing runs in the LTM and
LTM-HPC is conducted using a GIS and spatial databases ( Fig 4
Fig 3 XML job 1047297
le illustrating the syntax for job parameters task parameters and task commands
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519
Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege
article)
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 419
assigns hardware resources to all tasks and (4) a task which is a
command (ie a program or script) with path names for input and
output 1047297les and software resources assigned for each task Many
jobs and all tasks are run in parallel across multiple processors The
HPC job manager is the primary interface for submitting LTM-HPC
jobs to a cluster it uses a graphical user interface Jobs are also
submitted from a remote desktop using a client utility in Microsoft
Windows HPC Pack 2008
Fig 3 shows sample lines from an XML job description 1047297le used
below to create forecast maps by state Note that the highest level
contains job parameters parameters are passed to the HPC Server
for project name user name job type types and level of hardware
resources for the job etc Tasks are listed after as dependencies to
the higher-level job tasks here contain several parameters (eg
how the hardware resources are used) and commands (eg name
of the Python script to execute and the parameters for that script
such as the name of the input and output 1047297le names)
3 Architecture of the LTM and LTM-HPC
31 Main components
Several modi1047297cations were made to the LTM to make the LTM-
HPC run at larger spatial scales (ie larger datasets) and with 1047297ne
resolution Below we describe the structure of the components
that comprise the current version of the LTM (hereafter as the
single workstation LTM) and the features that were necessary to
recon1047297gure it for an HPCC There are several different kinds of
programming environments that comprise the single worksta-
tion LTM The 1047297rst are command-line interpreter instructions
con1047297gured as batch 1047297les for use in the Windows operating sys-
tem these are named using the BAT extension Batch 1047297les
control most of the processing of data for Stuttgart Neural
Network Simulator (SNNS) A second type of programming
environment that comprises the LTM are compiled programs
written to accept environment variables as inputs Programs are
written in C or C programming language as a standalone EXE
1047297le to be executed at the command line The environment vari-
ables for these programs are often the location and name of input
and output 1047297les Complied programs are used to transpose data
structures and to calculate very speci1047297c values during model
calibration The third kind of program environment is the script
environment written to execute application-speci1047297c tools
Application-speci1047297c scripts that we use here are ArcGIS Python
(version 26 or higher) scripts which call certain features and
commands of ArcGIS and Spatial Analyst A fourth type of soft-
ware environment is the XML jobs 1047297le these are used by the
Windows 2008 Server R2 job manager of the LTM-HPC to execute
and organize the batch routines compiled programs and scripts
in the proper order and with the necessary environment vari-
ables This fourth kind of software environment the XML jobs
1047297le is only present in the LTM-HPC
Fig 4 shows the sequence of batch routines programs and
scripts that comprise the LTM currently organized into the six main
model components data preparation pattern recognition cali-
bration validation forecasting and application Here we provide anoverview of the key features of the LTM and LTM-HPC emphasizing
how these features enable us to simulate land use cover change at a
national scale those batch routines and programs that have been
modi1047297ed for running in the HPC environment and con1047297gured using
XML job 1047297les are contained in the red boxes in Fig 4
32 Data preparation
Data preparation for training and testing runs in the LTM and
LTM-HPC is conducted using a GIS and spatial databases ( Fig 4
Fig 3 XML job 1047297
le illustrating the syntax for job parameters task parameters and task commands
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 253
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519
Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege
article)
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 519
Fig 4 Tool and data view of the LTM-HPC (see Legend for a description of model components and their meaning) (For interpretation of the references to color in this 1047297gure lege
article)
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 619
item 1) as this is done once for each simulation this LTM
component is not automated A C program called createpatexe
(Fig 4 item 2) is used to convert spatial data to neural net 1047297les
called a pattern 1047297le (Fig 4 item 3) given an PAT extension data
are transposed into the ANN structure Data necessary to process
1047297les for the training run for neural net simulation are model inputs
two land use maps separated by approximately 10 years or more
and a map of locations that need tobe excluded from the neural net
simulation Vector shape 1047297les (eg roads) and raster 1047297les (eg
digital elevation models land usecover maps) are loaded into
ArcGIS and ESRIrsquos Spatial Analyst is used to calculate values per
pixel in the simulation domain that are used as inputs to the neural
net A raster 1047297le is selected (eg base land use map) to set ESRI
Spatial Analyst Environment properties such as cell size numberof
row and columns for all data processing to ensure that all inputs
have standard dimensions A separate 1047297le referred to as the
exclusionary zone map (Fig 5 item 1) is created using a GIS
Exclusionary maps contain locations where a land use transition
cannot occur in the future For a model con1047297gured to simulate ur-
ban for example areas that are in protected areas (eg public
parks) open water or are already urban are coded with a lsquo4rsquo in the
exclusionary zone map This exclusionary map is used in several
steps of the LTM for excluding data that is converted for use inpattern recognition model calibration and model forecasting The
coding of locations with a lsquo4rsquo becomes more obvious below under
the presentation of model calibration Inputs (Fig 5 item 2) are
created by applying spatial transition rules outlined in Pijanowski
et al (2000) A frequent input map is distance to roads for our
LTM-HPC application example below Spatial Analystrsquos Euclidean
Distance Tool is used to calculate the distance each pixel is from the
nearest road All GIS data for use in the LTM are written out as an
ASCII 1047298at 1047297le (Fig 5A)
Two land use maps are used to determine the locations of
observed change (Fig 5 3) and these are necessary for the
training runs The program createpatexe (Fig 5B) stores a value of
lsquo0rsquo if no change in a land use class occurred and a lsquo1rsquo if change was
observed (Fig 5 5) The testing run does not use land use mapsfor input and the output values are estimated by the neural net in
the phase of the model The program createpat uses the same
input and exclusionary maps to create a pattern 1047297le for testing
(Fig 5 item 3)
A key component of the LTM is converting data from a GIS
compatible format to a neural network format called a pattern 1047297le
(Fig 5C) Conversion of 1047297les from raster maps to data for use by the
neural network requires both transposing the database structure
and standardizing all values (Fig 5 6) The maximum value that
occurs in the input maps for training is also stored in the input 1047297le
and this is used to standardize all values from the input maps
because the neural network can only use values between 00 and
10 (Fig 5C) Createpatexe also uses the exclusionary map (Fig 4
1) in ASCII format to exclude all locations that are not convertibleto the land use class being simulated (eg wildlife parks should not
convert to urban) For training runs createpatexe also selects
subsamples of the databases (by location) the percentage of the
data to be selected is speci1047297ed in the input 1047297le Finally crea-
tepatexe also checks the headers of all maps to ensurethat they are
of the same dimensions
33 Pattern recognition
SNNS has several choices for training the program that per-
forms training and testing is called batchmanexe (Fig 4 item 4)
As this process uses a subset of data and cannot be parallelized
easily we conducted training on a single workstation Batchma-
nexe allows for several options which are employed in the LTM
These include a ldquoshuf 1047298erdquo option which randomly orders the data
presented to the neural network during each pass (ie cycle) (cf
Shellito and Pijanowski 2003 Peralta et al 2010) the values for
initial weights (cf Denoeux and Lengelleacute 1993) the name of the
pattern 1047297les for input and output the 1047297lename containing the
network values and a set of start and stop conditions (eg a stop
condition can be set if a MSE or a certain number of cycles is
reached) We control the speci1047297c batchmanexe execution pa-
rameters using a DOS batch 1047297 le called trainbat (Fig 4 item 5)
Training is followed over the training cycles with MSE ( Fig 4 item
6) and1047297les (called ltmnet Fig 4 item 7)with weights bias and
activation function values saved every N number of cyclesAn MSE
equal to 00 is a condition that output of ANN matches the data
perfectly (Bishop 2005) Pijanowski et al (2005 2011) has shown
that LTM stabilizes after less than 100000 cycles in most cases
Pseudo-code for the TRAINBAT is
loadNet(ldquoltmnetrdquo)
loadPattern(ldquotrainpatrdquo)
setInitFunc (ldquoRandomize_Weightsrdquo 10 10)
setShuf 1047298e (TRUE)
initNet()
trainNet()while MSE gt 00 and CYCLES lt500000 do
if CYCLES mod 100 frac14 frac14 0 then
print (CYCLESldquo rdquoMSE)
endif
if CYCLES frac14 frac14 100 then
saveNet (ldquo100netrdquo)
endif
We used the SNNS batchmanexe program to create a suitability
map (ie a map of ldquoprobabilityrdquo of each cell undergoing urban
change) used for forecasting and calibration to do this data have to
be converted from SNNS format to a GIS compatible format (Fig 4
item 11) The testPAT 1047297les (Fig 4 item 13) are converted to
probability maps by applying the saved ltmnet 1047297le (1047297le with theweights bias and activation values Fig 4 item 8) produced from
the training run using batchmanexe (Fig 4 item 8) Output from
this execution is calleda RES(or result)1047297le (Fig 4 item9)The RES
1047297le contains estimates of output created by the neuralnetworkThe
RES 1047297le is then transposed back to an ASCII map (Fig 4 item 10)
using a C program All values from the RES 1047297les are between 00
and 10 convert_asciiexe also stores values in the ASCII suitability
(Fig 4 item 11) maps as integer values between 0 and 100000 by
multiplying the RES 1047297le values by 100000 so that the raster1047297le in
ArcGIS is not 1047298oating point (1047298oating point 1047297les in ArcGIS require a
less ef 1047297cient storage format and thus large 1047298oating point 1047297les
through ArcGIS 100 are unstable)
We also train on data models where we ldquohold one input outrdquo at
a time (Fig 4 item 12 Pijanowski et al 2002a Tayyebi et al2010) for example in one set distance to roads is held out and
compared to having all inputs in the training Thus if we start with
a 1047297ve input variable neural network model we hold one out at a
time and create calibration time step maps for each and save error
1047297les over training cycles for each of the reduced input variable
models
34 Calibration
For model calibration (see Bennett et al 2013 for an extensive
review of the topic our approach follows their recommendations)
we consider three different sets of metrics to judge the goodness of
1047297t of the neural network model The 1047297rst is mean square error
(MSE) which is plotted over training cycles to ensure that the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 255
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 719
Fig 5 Data processing steps for converting data from a GIS format to a pattern 1047297le format for use in SNNS
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 819
neural network settles at a global minimum value MSE is calcu-
lated as the difference between the estimate produced by the
neural network (range 00e10) and the observed value of land use
change (0 or 1) MSE values are saved very 100 cycles and training is
generally followed out to about 100000 cycles The second set of
goodness of 1047297t metrics is those created from a calibration map A
calibration map is also constructed within the GIS using three maps
coded specially for assessment of model goodness of 1047297t A map of
observed change between the two historical maps (Fig 4 item 16)
is created such that observed change frac14 0 and no change is frac14 1 A
mapthat predicts the same land use changes over the same amount
of time (Fig 4 15) is coded so that predicted change is frac14 2 and no
predicted change is frac14 0 These two maps are then summed along
with the exclusionary zone map that is coded 0 frac14 location can
change and 4 frac14 location that needs to be excluded The resultant
calibration map (Fig 4 17) generates values 0 through 4 with
correct predictions of 0 frac14 correctly predicted no change and
3 frac14 correctly predicted change Values of 2 and 3 represent
different errors (omission and commission or false positive and
false negative) The proportion of each type of error and correctly
predicted location are used to calculate (1) the proportion of
correctly predicted change locations to the number of observed
change cells also called the percent correct metric (proportion of
correctly predicted land use changes to the number of observed
land use changes) or PCM (Pijanowski et al 2002a) (2) sensitivity
(the proportion of false positives) and speci1047297city (the proportion of
false negatives) and (3) scaleable PCM values across different
window sizes
Fig 6 shows how scaleable PCM values are calculated using a
01234-coded calibration map across different window sizes The
1047297rst step is to calculate the total number of true positives (cells
coded as 3s)in the calibration map (Fig 6A) For a givenwindowof
say(eg5 cells by 5 cells) a pair of false positives (cells code as 2s)
and false negative (cells coded as 1s) are considered together as a
correct prediction at that scale and window the number of 3s is
incremented by one for every pair of false positive and false
negative cells The window is then movedone position to theright
(Fig 6B) and pairs of 1s and2s areagain added to the total number
of 3s for that calibration map such that any 1s or 2s already
counted are not considered This moving N N window is passed
across the entire simulation area and the 1047297nal number of 3s
recorded (Fig 6C) The window size is then incremented by 2 (ie
the next window size after a 5 5 would be a 7 7) and after all
of the windows are considered in the map the process is repeated
Fig 6 Steps in the calculation of PCM across a moving scaleable window Part 6A calculates the total number of true positives (coded as 3s) The window is then moved one position
to the right (Part 6B) and pairs of 1s and 2s are again added to the total number of 3s This moving window is passed across the entire area and the 1047297nal number of 3s recorded (Part
6C) The window size is then incremented by 2 and the process is repeated Part 6D gives PCM across scaleable window sizes
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 257
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 919
(note that the number of 3s is reset to the number of 3s is the
entire calibration map) and the number of 3s saved for that
window size Window sizes that we often plot are between 3 and
101 Fig 6D gives an example PCM across scaleable window sizes
Note in this plot that the PCM begins to exceed 50 around a
window size of 9 9 which for this simulation conducted at
100x 100m means that PCM reaches 50 at 900m 900m The
scaleable window plots are also made for each reduced input
model as well in order to determine the behavior of the training of
the neural network against the goodness of 1047297t of the calibration
maps by input
The 1047297nal step for calibration is the selection of the network 1047297le
(Fig 4 items 16e19) with inputs that best represent land use
change and an assessment of how well the model predicts across
different spatial scales The network 1047297le with the weights bias
activation values are saved for the model with the inputs consid-
ered the best for the model application If the model does not
perform adequately (Fig 4 item 19) the user may consider other
input drivers or dropping drivers which reduce model goodness of
1047297t However if the drivers selected provide a positive contribution
to the goodness of 1047297t and the overall model is deemed adequate
then this network 1047297le is saved and used in the next step model
validation
35 Validation
We follow the recommended procedures of Pontius et al
(2004) and Pontius and Spencer (2005) to validate our model
Brie1047298y we use an independent data set across time to conduct an
historical forecast to compare a simulated map (Fig 4 15) with an
observed historical land use map that was not used to build the
ANN model (Fig 4 20) For example below (Section 46) we
describe how we use a 2006 land use map that was not used to
build the model to compare to a simulated map Validation metrics
(Fig 4 21) include the same as that used for calibration namely
PCM of the entire map or spatial unit sensitivity speci1047297city PCM
across window sizes and error of quantity It should be noted thatbecause we 1047297x the quantity of the land use class that changes be-
tween time 1 and time 2 for calibration we do so for validation as
well (eg between time 2 and time 3 the number of cells that
changed in the observed maps are used to1047297x the quantity of cells to
change in the simulation that forecasts time 3)
36 Forecasting
We designed the LTM-HPC so that the quantity model (Fig 4
24) of the forecasting component can be executed for any spatial
unit category like government units watersheds or ecoregions or
any spatial unit scale such as states counties or places The
quantity model is developed of 1047298ine using Exceland algorithms that
relate a principle index driver (PID see Pijanowski et al 2002a)that scales the amount of land use change (eg urban or crops) per
person In theapplication described below we execute the model at
several spatial unit scales e cities states and the lower 48 states
Using a combination of unique unit IDs (eg federal information
processing systems (FIPS) codes are used for government unit IDs)
a 1047297le and directory-naming system XML 1047297les and python scripts
the HPC was used to manage jobs and tasks organized by the
unique unit IDs
We next use a program written in C to convert probability
values to binarychange values (0 are cells without change and 1 are
locations of change in prediction map) using input from the
quantity change model (Fig 4 24) The quantity change model
produces a table of the number of cells to grow for each time step
for each spatial unit froma CSV 1047297
le Rowsin the CSV 1047297
le contain the
unique unit IDS and the number of cells to transition for each time
step The program reads the probability map for the spatial unit
(ie a particular city) being simulated counts the number of cells
for each probability value and then sorts the values and counts by
rank The original order is maintained using an index for each re-
cord The probability values with high rank are then converted to
urban (code 1) until the numbers of new urban cells for each unit is
satis1047297ed while other cells (code 0) remain without change A
separate GIS map (Fig 4 25) may be created that would apply
additional exclusionary rules to create an alternative scenario
Output from the model (Fig 4 item 26) is used for planning or
natural resource management (Skole et al 2002 Olson et al 2008)
(Fig 4 item 27) as input to other environmental models (eg Ray
et al 2012 Wiley et al 2010 Mishra et al 2010 or Yang et al
2010) (Fig 4 item 28) or the production of multimedia products
that can be ported to the internet (Fig 4 item 29)
37 HPC job con 1047297 guration
We developed a coding schema for the purposes of running the
simulation across multiple locations We used a standard
numbering system from the Federal Information Processing Sys-
tems (FIPS) that is associated with states counties and places FIPSis a hierarchical numbering system that assigns states a two-digit
code and a county in those states a three-digit code A speci1047297c
county is thus given a 1047297ve-digit integer value (eg 18157 for Tip-
pecanoe County Indiana) and places are given a seven-digit code
two digits for the state and 1047297ve digitsfor the place (eg1882862 for
the city of West Lafayette Indiana)
Con1047297guring HPC jobs and constructing the associated XML 1047297les
can be approached in different ways The 1047297rst is to develop one job
and one XML 1047297le per model simulation component (eg mosaick-
ing individual census place spatial maps into a national map) For
our LTM-HPC application where we would need to mosaic over
20000 census places a job failure for any of the places would result
in the one large job stopping and then addressing the need to
resume the execution at the point of failure A second approachused here is to group tasks into numerous jobs where the number
of jobs and associated XML 1047297les is still manageable A failure of one
census place would require less re-execution and trouble shooting
of that job We often grouped the execution of census place tasks by
state using the FIPS designator for both to assign names for input
and output 1047297les
Five different jobs are part of the LTM-HPC (Fig 7) those for
clipping a large 1047297le into smaller subsets another for mosaicking
smaller 1047297les into one large 1047297le one for controlling the calibration
programs another job for creating forecast maps and a 1047297fth for
controlling data transposing between ASCII 1047298at 1047297les and SNNS
pattern 1047297les XML 1047297les are used by the HPC job manager to subdi-
vide the job into tasks for example our national simulation
described below at county and places levels is organized by stateand thus the job contains 48 tasks one for each state Fig 7 is a
sample Windows jobs manager interface for mosaicking over
20000 places Each topline Fig 7 (item 1) represents an XML for a
region (state) with the status (item 2) Core resources are shown
(Fig 7 item 3) A tab (Fig 7 item 4) displays the status of each
task (Fig 7 item 5) within a job We used a python script to create
each of the xml 1047297les although any programming or scripting lan-
guage can be used
We then used an ArcGIS python script to mosaic the ASCII
maps an XML 1047297le that lists 1047297le and path names was used as input
to the python script Mosaicking and clipping are conducted in
ArcGIS using python scripts polygon_clippy and poly-
gon_mosaicpy Both ArcGIS python scripts read the digital spatial
unit codes from a variable in the shape 1047297
le attribute table and
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268258
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1019
names 1047297les based on the unit code The resultant mosaicked
suitability map produced from training and data transposing
constitutes a map of the entire study domain Creating such a
suitability map of the entire simulation domain allows us to (1)
import the ASCII 1047297le into ArcGIS in order to inspect and visualize
the suitability map (2) allow the researcher to use different
subsetting and mosaicking spatial units (as we did below) and (3)
allow the researcher to forecast at different spatial units (we also
illustrate this below as well)
4 Execution of LTM-HPC
41 Hardware and software description
We executed the LTM-HPC on three computer systems (Fig 8)
One computer a high-end workstation was used to process inputs
for the modeling using GIS A windows cluster was used to
con1047297gure the LTM-HPC and all of the processing of about a dozen
steps occurred on this computer system A third computer system
stored all of the data for the simulations Speci1047297c con1047297guration of
each computer system follows
Data preparation was performed on a high-end Windows 7
Enterprise 64-bit computer workstation equipped with 24 GB of
RAM a 256 GB solid state hard drive a 2 TB local hard drive and
ArcGIS 100 with Spatial Analyst extension Speci1047297c procedures
used to create each of the data layers for input to the LTM can be
found elsewhere (Pijanowski et al 1997 Tayyebi et al 2012)
Brie1047298y data were processed for the entire contiguous United States
at 30m resolution and distance to key features like roads and
streams were processed using the Euclidean Distance tool in Arc-
GIS setting all output to double precision integer given the large
size of each dataset we limited the distance to 250 km Once thedata were processed on the workstation 1047297les were moved to the
storage server
The hardware platform on which the parallelization was carried
out was a cluster of HPC consisting of 1047297ve nodes containing a total
of 20 cores Windows Server HPC Edition 2008 was installed on the
HPCC Each node was powered by a pair of dual core AMD Opteron
285 processors and 8 GB of RAM Each machine had two 1 GBs
network adapters with one used for cluster communication and the
other for external cluster communication Each node had 74 GB of
hard drive space that was used for the operating system and soft-
ware but was not used for modeling The HPC cluster used for our
national LTM application consisted of one server (ie head node)
that controls other servers (ie compute nodes) which read and
write data from a data server A cluster is the top-level unit which
Fig 7 Data structure programs and 1047297les associated with training by the neural network Item 1 represents an XML fora region (state) with the status (item 2) Core resources are
shown in item 3 Item 4 displays the status of each task (item 5) within a job
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 259
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1119
is composed of nodes or single physical or logical computers with
one or more cores that include one or more processors All
modeling data was read and written to a storage machine located in
another building and transferred across an intranet with a
maximum of 1 Gigabit bandwidth
The data storage server was composed of 24 two terabyte
7200 RPM drives in a RAID 6 con1047297guration This server also had
Windows 2008 Server R2 installed Spot checks of resource moni-
toring showed that the HPC was not limited by network or disk
access and typically ran in bursts of 100 CPU utilization ArcGIS
100 with the Spatial Analyst extension was installed on all servers
Based on the results of the 1047297le number per folder and the use of
unique unit IDs as part of the 1047297le and directory-naming scheme we
used a hierarchical directory structure as shown in Fig 9 The upper
branches of the directory separate 1047297les into input and output di-
rectories and subfolders store data by type (ASC or PAT 1047297les)
location unit scale (national state) and for forecasts years and
scenarios
Fig 9 Directory structure for the LTM-HPC simulation
Fig 8 Computer systems involved in the LTM-HPC national simulations
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268260
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1219
42 Preliminary tests
The primary limitation in 1047297le size comes from SNNS This limit
was reached in the probability map creation phase in several
western US counties when the RES1047297lewhichcontainsthe values
for all of the drivers (eg distance to urban etc) crashed To
overcome this issue we divided the country into grids that pro-
duced 1047297les that SNNS was capable of handling for the steps up to
and including pattern 1047297le creation which is done on a pixel-by-
pixel basis and is not spatially dependent For organization and
performance reasons1047297leswere grouped into folders by stateAs the
SNNS only uses the probability values in the projection phase we
were able to project at the county level
Early tests with mosaicking the entire country at once were
unsuccessful and led to mosaicking by state The number of states
and years of projection for each state made populating the tool
1047297elds in ArcGIS 100 Desktop a time intensive process We used
python scripts to overcome this issue and the HPC to process
multiple years and multiple states at the same time Although it is
possible to run one mosaic operation for each core we found that
running 24 operations on a machine led to corrupted mosaics We
attribute this to the large 1047297le sizes and limited scratch space
(approximately 200 GB) and to overcome this problem we limitedthe number of operations per server by specifying each task to 6
cores for most states and 12 cores for very large states such as CA
and TX
43 Data preparation for national simulation
We used ArcGIS 100 and Spatial Analyst to prepare 1047297ve inputs
for use in training and testing of the neural network Details of the
data preparation can be found elsewhere (Tayyebi et al 2012)
although a brief description of processing and the 1047297les that were
created follow We used the US Census 2000 road network line
work to create two road shape 1047297les highways and main arterials
We used ArcGIS 100 Spatial Analyst to calculate the distance that
each pixel was away from the nearest road Other inputs includeddistance to previous urban (circa 1990) distance to rivers and
streams distance to primary roads (highways) distance to sec-
ondary roads (roads) and slope
Preparing data for neural net training required the following
steps Land use data from approximately 1990 and 2000 were
collected from 18 different municipalities and 3 states These data
were derived from aerial photography by local government and
were thus deemed to be of high quality Original data were vector
and they were converted to raster using the simulation dimensions
described above Data from states were used to select regions in
rural areas using a random site selection procedure (described in
Tayyebi et al 2012)
Maps of public lands were obtained from ESRI Data Pack 2011
(ESRI 2011) Public land shape 1047297les were merged with locations of urban and open water in 1990 (using data from the USGS national
land cover database) and used to create the exclusionary layer for
the simulation Areas that were not located within the training area
were set to ldquono datardquo in ArcGIS Data from the US census bureau for
places is distributed as point location data We used the point lo-
cations (the centroid of a town city or village) to construct Thiessen
polygons representing the area closest to a particular urban center
(Fig 10) Each place was labeled with the FIPS designated census
place value
We executed the national LTM-HPC at three different spatial
scales and using two different kinds of spatial units ( Tayyebi et al
2012) government and 1047297xed-size tiles The three scales for our
government unit simulations were national county and places
(cities villages and towns)
All input maps were created at a national scale at 30m cell
resolution For training data were subset using ArcGIS on the local
computer workstation and pattern 1047297les created for training and
1047297rst phase testing (ie calibration) We also used the LTM-clippy
Python script to create subsamples for second phase testing In-
puts and the exclusionary maps were clipped by census place and
then written out as ASC 1047297les The createpat 1047297le was executed per
census place to convert the 1047297les from ASC to PAT
44 Pattern recognition simulations for national model
We presented a training 1047297le with 284477 cases (ie records or
locations) to the neural network using a feedforward back propa-
gation algorithm We followed the MSE during training saving this
value every 100 cycles We found that the minimum MSE stabilized
globally at 49500 cycles The SNNS network 1047297le (NET 1047297le) was
produced every 100 cycles so that we could analyze the training
later but the network 1047297le for 49500 cycles was saved and used to
estimate output (iepotential fora land usechange to occur at each
location) for testing
Testing occurred at the scale of tiles The LTM-clippy script was
used to create testing pattern 1047297les for each of the 634 tiles The
ltm49500nNET 1047297le was applied to each tile PAT 1047297le to create an
RES 1047297le for each tile RES 1047297les contain estimates of the potential
for each location to change land use (values 00 to 10 where closer
Fig 10 Spatial units involved in the LTM-HPC national simulation
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 261
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1319
to 10 means higher chance of changing) The RES 1047297les are con-
verted to ASC 1047297les using a C program called convert2asciiexe The
ASC probability maps for all tiles were mosaicked to a national
raster 1047297le using an ArcGIS python script All original values which
range from 00 to 10 are multiplied by 100000 by convert2ascii so
that they may be stored as double precision integer
We used three-digit codes as unique numbers for naming tile
1047297les and tracking them as tasks within states on HPC (634 grids
in conterminous of USA) Each tile contained a maximum of
4000 rows and 4000 columns of 30m pixels We were able to do
this because the steps leading up to prediction work on a per
pixel basis and thus the processing unit did not affect the output
value
45 Calibration of the national simulation
We trained on six neural network versions of the model one
that contained 1047297ve input variables and 1047297ve that contained four
input variables each where we dropped out one input variable from
the full input model We saved the MSE at each 100 cycles through
100000 cycles and then calculated the percent difference of MSE
from the full input variable model (Fig 11) Note that all of thevariables have a positive contribution to model goodness of 1047297t
during training distance to highways provides the neural network
with the most information necessary for it to 1047297t input and output
data This plot also illustrates how the neural network behaves
between 0 cycles and approximately cycle 23000 the neural
network makes large adjustments in weights and values for acti-
vation function and biases At one point around 7000 cycles the
model does better (ie percentage difference in MSE is negative)
without distance to streams as an input to the training data
Eventually all drop one out models stabilize near 50000 which is
where the full 1047297ve-variable model also stabilizes At this number of
training cycles distance to highway contributes about 2 of the
goodness of 1047297t distance to urban about 15 slope about 12 and
distance to road and distance to streams each about 07 Weconclude from this drop one out calibration that (1) all 1047297ve
variables contribute in a positive way toward the goodness of 1047297t
and (2) that 49500 cycles provide enough learning of the full 1047297ve-
variable model to use for validation
The second step of calibration is to examine how well the model
produces spatial maps of change compared to the observed data
(eg Fig 5A) We use the locations of observed change from the
training map that are outside the training locations to create a
01234-coded calibration map The XML_Clip_BASE HPC jobs 1047297le
was modi1047297ed to receive the 01234-coded calibration map and
general statistics (eg percentage of each value) are created for the
entire simulation domain and for smaller subunits (eg spatial
units)
46 Validation of the national model
We used the 2006 NLCD urban map and a 2006 forecast map
from the LTM to create a 01234-coded validation map that was
assessed for goodness of 1047297t in several ways two of which are
presented here The 1047297rst goodness of 1047297t metric examined how
well the model predicted the correct number of urban cells per
simulation tile This analysis was not computationally rigorous so
this assessment was performed on the single workstation Weused ArcGIS 100 TabluateArea command in Spatial Analyst to
calculate the amount of area for each of the codes 0 1 2 and 3
The percentage of the amount of urban correctly predicted was
then mapped (Fig 12A) Note that the model predicted the cor-
rect amount of urban cells in most simulation tiles Only a few
along coastal areas contained errors in quantity of urban greater
than 5
The second goodness of 1047297t assessment highlights the use of the
HPC to calculate a computationally rigorous calculation that char-
acterizes location error The XML_Clip_BASE jobs 1047297le was modi1047297ed
to receive the 01234-coded validation map at the spatial unit of
tiles The XML_Scaleable jobs 1047297le was used to execute the scaleable
window routine for each tile from a 3 3 window size through
101
101 window size The percent correct metric was saved at the10 10 window size (ie 3 km by 3 km) and PCM values merged
Fig 11 Drop one out percent difference MSE from full driver model
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268262
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1419
with the shape 1047297le for tiles Note (Fig12B) that the model goodness
of 1047297t is best east of the Mississippi River along the west coast and in
certain areas of the central United States where there are large
metropolitan cities (eg Denver) Improvement of the model thus
needs to concentrate on rural areas of the central and western
portions of the United States Similar maps are often constructed
for different window sizes to determine if scale of prediction
changes spatially
47 Forecasting
Forecasting requires merging the suitability map and the
quantity model We used several XML jobs 1047297les to construct the
forecast maps at the national scale We developed our quantity
model (Tayyebi et al 2012) that contained the number of urban
cells to grow for each polygon for 10-year time steps from 2010 to
2060 We considered each state as a job and including all the
Fig 12 Validation metrics of (A) quantity errors and (B) model goodness of 1047297t (PCM) for scaleable window size of 3 3 km
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 263
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1519
polygons within the state as different tasks to create forecast maps
of each polygon We embedded the path of prediction code and
number of urban cells to grow for each polygon within
XML_Pred_BASE job 1047297le We ran XML_Pred_BASE job 1047297le for each
state on HPC to convert the probability map to forecast map for
each polygon Then we ran the Mosaic_Python script on the HPC
using XML_Pred_Mosaic_BASE to mosaic the prediction pieces at
the polygon level to create forecast maps at state levelSimilarly we
ran Mosaic_Python script on HPC using XML_Pred_Mosaic_Na-
tional to mosaic prediction pieces at state level to create a national
forecast map HPC also enabled us to export error messages in error
1047297les so that if any of tasks fail in a job using standard out and
standard error 1047297les to have records of program did during execu-
tion We also embedded the path of standard out and standard
error 1047297les in the tasks of the XML jobs 1047297le
Wegenerated decadal maps of land use from 2010 through 2050
from this simulation Maps of new urban (red) superimposed on
2006 land usecover from the USGS National Land Cover Maps for
eight regions are shown in Fig 13 Note that the model produces
different spatial patterns of urbanization depending on the loca-
tion urbanization in the Los AngeleseSan Diego region are more
clumped likely to due to topographic limitations of the area in the
large metropolitan area Dispersed urbanization is characteristic of 1047298at areas like Florida Atlanta and the Northeast
5 Discussion
We presented an overview of the conversion of a single work-
station land change model that has been converted to operate using
a high performance computer cluster The Land Transformation
Model was originally developedto simulatesmall areas (Pijanowski
et al 2000 2002b) such as watersheds However there is a need
for larger sized land change models especially those that can be
coupled to large-scale process models such as climate change (cf
Olson et al 2008 Pijanowski et al 2011) and dynamic hydrologic
models (Yang et al 2010 Mishra et al 2010) We have argued that
to accomplish the goal of increasing the size of the simulationseveral challenges had to be overcome These included (1) pro-
cessing of large databases (2) the management of large numbers of
1047297les (3) the need for a high-level architecture that integrates
model components (4) error checking and (5) the management of
multiple job executions Here we brie1047298y discuss how we addressed
these challenges as well as lessons learned in porting the original
LTM to an HPC environment
51 Challenges of executing large-scale models
We found that the large datasets used for input and output were
dif 1047297cult to manage successfully within ArcGIS 100 The 1047297les had to
be managed as smaller subsets either as states or regions (ie
multiple states) and in the case of Texas we had to manage thisstate as separate counties Programs written in C had to read and
write lines of data at a time rather than read large 1047297les into a large
array This is needed despite a large amount of memory contained
in the HPC
The large number of 1047297les were managed using a standard 1047297le
naming coding system and hierarchical arrangement of folders on
our storage server The coding system also helped us to construct
the xml 1047297le content used by the job manager in Windows 2008
Server R2
The high-level architecture was designed after the proper steps
that have been outlined by prominent land change modeling sci-
entists (Pontius et al 2004 Pontius et al 2008) These include
steps for (1) data sampling from input 1047297les (2) training (3) cali-
bration (4) validation and (5) application Job 1047297
les were
constructed for steps the interfaced each of these modeling steps
In fact we found quickly discovered that the most logical directory
structure mirrored the high-level architecture of the model
We experienced that jobs or tasks can fail because of one of the
following errors (1) one or more tasks in the job have failed This
indicates that one or more tasks could not be run or did not com-
plete successfully We speci1047297ed standard output and standard error
1047297les in the job description to determine which executable 1047297les fail
during execution (2) A nodeassignedto the job ortaskcouldnot be
contacted Jobs or tasks that fail because of a node falling out of
contact are automatically retried a certain numberof times but will
eventually failif the problem continues (3) The run time for a job or
task run expired The job scheduler service cancels jobs or tasks
that reach the end of their run time (4) A 1047297le location required by
the job or task could not be accessed A frequent cause of task
failures is inaccessibility of required 1047297le locations including the
standard input output and error 1047297les and the working directory
locations
52 Lessons learned from converting the LTM to an HPC
The limited number of probability maps created in our simula-
tion meant that simple folder structure were only needed whichmade it easy to mosaic manually However the prediction output
was stored by state and by year which made mosaicking a time
consuming and an error prone process in some cases we needed
to manually mosaic a few areas as the job manager would crash
The HPC was employed to speed up the mosaicking process but this
was not a fail-safe process A short python script that ran the ArcGIS
mosaic raster tool was the heart of the process The 9000 network1047297les of LTM-HPC generated from training run were applied to each
pattern 1047297le derived from boxes that contained all of the cells in the
USA except those within the exclusionary zone Finally states were
manually mosaicked to create the national probability map for USA
(Fig 13)
A windows HPC cluster was used to decrease the time required
to process the data by running the model on multiple spatial unitssimultaneously The time required to run LTM can be thought of as
the time it would take to run the LTM-HPC serially When running
the LTM-HPC the amount of time required relative to LTM is
approximately halved for every doubling of cores This variance in
processing time is caused by variance in 1047297le size The HPC also
provides additional bene1047297ts to researchers who are interested in
running large-scale models These include the reduction in the
need for human control of various steps which thereby reduces the
changes of human error It also allows researchers to execute the
model in a variety of con1047297gurations (eg here we were able to run
the model using different spatial units testing issues related to
scale) allowing for researchers to run ensembles
We also found that developing and executing the model across
three computer systems (data storage data processing and codingand simulation) worked well Delegating tasks to each of these
helped to manage work 1047298ow and optimize the purpose of each
computer system
53 Needs for land change model forecasts at large extents and 1047297ne
resolution
Models that must simulate large areas at 1047297ne resolutions and
produce output that has multiple time steps require the handling
and management of big data Environmental simulations have
traditionally focused on small spatial extents at 1047297ne resolutions to
produce the required output However environmental problems
are often at large extents and coarse resolution simulations or
alternatively at small extents and 1047297
ne resolutions may hinder the
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268264
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1619
ability to assess impacts at the necessary scale Land change models
are often used to assess how human use of the land may impact
ecosystem health It is well known that land use cover change
impacts ecosystem processes at a variety of spatial scales ( Reid
et al 2010 GLP 2005 Lambin and Geist 2006) Some of the
most frequently cited ecosystem impacts include how land use
change at large extents affect the total amount of carbon
sequestered in aboveground plants and soils in a region (eg Dixon
et al 1994 Post and Kwon 2000 Cox et al 2000 Vleeshouwers
and Verhagen 2002 Guo and Gifford 2002) how patterns and
amounts of certain land covers (eg forests urban) affect invasive
species spread and distributions (eg Sharov et al 1999 Fei et al
2008) how land surface properties feedback to the atmosphere
through alterations of water and energy 1047298
uxes (eg Dale 1997
Fig 13 LTM 2050 urban change forecasts for different regions (For interpretation of the references to color in this 1047297gure legend the reader is referred to the web version of this
article)
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 265
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1719
Pielke 2005 Bonan 2008 Pijanowski et al 2011) how certain
land uses such as urban and agriculture increase nutrients and
pollutants to surface and ground water bodies (Pijanowski et al
2002b Tang et al 2005ab) and how land use patterns affect
biodiversity of terrestrial (Pekin and Pijanowski 2012) and aquatic
ecosystems such as freshwater 1047297sh organisms (Wiley et al 2010)
In all cases more urban decreases ecosystem health (cf Pickett
et al 1997 Reid et al 2010 Grimm and Redman 2004 Kaye
et al 2006)
Assessment of land use change impacts has often occurred by
coupling land change models to other environmental models For
example the LTM has been coupled to the Regional Atmospheric
Modeling Systems (RAMS) to assess how land use change might
impact precipitation patterns at subcontinental scales in East Africa
(Moore et al 2010) coupled to the Variable Impact Calculator (VIC)
model in the Great Lakes basin (Yang 2011) to the Long-Term Hy-
drologic Impact Assessment (L-THIA) model assess how land use
change might impact overland 1047298ow patterns in large regional wa-
tersheds and how nutrient1047298uxes and pollutants from urban change
would impact stream ecosystem health in large watersheds (Tang
et al 2005ab) The next step in our development will be to
couple the output of this model to a variety of environmental
impact models that are spatially explicit We intend to conduct thatwork in the HPC environment using the principles that we outline
above
The LTM-HPC model presented here can also be modi1047297ed to
address other land change transitions For example the LTM-
HPC can be con1047297gured to simulate multiple transitions at a
time this might include the loss of urban along with urban gain
(which is simulated here) or include the numerous land tran-
sitions common to many areas of the world namely the loss of
natural lands like forests to agriculture the shift of agriculture
to urban the loss of forests to urban and the transition of
recently disturbance areas (eg shrubland) to more mature
natural lands like forests To accomplish multiple transitions a
variety of rules need to be explored further to determine how
they would be applied to the model It is also quite possible thatsuch a large may be heterogeneous several transition rules may
need to be applied in the same simulation with rules applied to
areas based on another higher-level rule The LTM-HPC could
also be con1047297gured to simulate subclasses of land use following
Dietzel and Clarke (2006) For example within the urban class
parking lots in the United States cover large extents but are
relatively small areas (Davis et al 2010ab) such an application
could require the LTM-HPC because 1047297ne resolutions would be
needed Likewise simulating crop cover types annually at a
national scale (cf Plourde et al 2013) could product a consid-
erable amount of temporal information At a global scale we
have found that subclasses of land usecover in1047298uence species
diversity patterns especially those vertebrates that are rare and
of a threatened category (Pekin and Pijanowski 2012) and soglobal scale simulations are likely to need models like LTM-HPC
The LTM-HPC could also support national or regional scale
environmental programmatic assessments that are becoming more
common supported by national government agencies These
include the 2013 United States National Climate Assessment Pro-
gram (USGCRP 2013) National Ecological Observation Network
(NEON) supported in the United States by the National Science
Foundation (Schimel et al 2007 Kampe et al 2010) and the Great
Lakes RestorationInitiative which seeks to develop State of the Lake
Ecosystem metrics of ecosystem services (SOLEC Bertram et al
2003 WHCEC 2010) In Europe an EU15 set of land use forecasts
have been used extensively to study the impacts of land use and
climate change on this continentrsquos ecosystem services (cf
Rounsevell et al 2006)
54 Calibration and validation of big data simulations
We presented a preliminary assessment of the model goodness
of 1047297t for the LTM-HPC simulations (eg Fig 12) A rigorous assess-
ment would require more effort placed on (1) 1047297ne resolution ac-
curacy (2) a quanti1047297cation of the variability of 1047297ne resolution
accuracy across the entire simulation (3) errors associated with
forecasting (ie temporal measures of model goodness of 1047297t) (4)
the relative cost of an error (ie whether an error of location is
important to the application) and measures of data input quality
We were able to show that at 3 km scales the error of location
varied considerably across the simulation domain Errors were
greater in the eastern portion of the United States for quantity
Patterns of error were different from quantity they were lower in
the eastern for quantity (Fig 12) Location of errors could be
important too if they affect the policies or outcomes of environ-
mental assessment If policies are being explored to determine the
impact of land use change in stream riparian zones model location
accuracy needs to be good along streams If environmental impacts
are being assessed then covariates such as soil which tends to be
spatially heterogeneous needs to be taken into consideration
Current model goodness of 1047297t metrics have not been designed to
consider large big data simulationssuch as the one presented herethus more research in this area is needed to make a full assessment
of how well a model like this performs
6 Conclusions
This paper presents the application of the LTM-HPC at multi-
scale using quantity drivers (a 1047297ne-scale urban land use change
model applied across the conterminous of USA) and introduces a
new version of LTM with substantially augmented functionality We
described a parallel implementation of the data and modeling
process on a cluster of multi-core processors using HPC as a data-
parallel programming framework We focus on ef 1047297ciently
handling the challenges raised by the nature of large datasets and
show how they can be addressed effectively within the computa-tion framework by optimizing the computation to adapt to the
nature of the data We signi1047297cantly enhance the training and
testing run of the LTM and enable application of the model for
region scale such as continent Future research will also be able to
use the new information generated by the LTM-HPC to address
questions related to how urban patterns relate to the process of
urban land use change Because we were able to preserve the high-
resolution of the land usedata (30m resolution) LTM-HPC provided
the capability of visualizing alternative future scenarios at a
detailed scale which helped to engage urban planner in the sce-
nario development process We believe this project represents an
important advancement in computational modeling of urban
growth patterns In terms of simulation modeling we have pre-
sented several new advancements in the LTM modelrsquos performance
and capabilities More importantly however this project repre-
sents a successful broad-scale modeling framework that has direct
applications to land use management
Finally we found that the LTM-HPC has some signi1047297cant ad-
vantages over the single workstation version of the LTM These
include
(1) Automated data preparation data can now be clipped and
converted to ASCII format automatically at the state county
or any other division using unique identity for the unit in
Python environment
(2) Better memory usage The source code for the model in C
environment has been changed making calculations per-
formedby LTM-HPCcompletelyindependent from the size of
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268266
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1819
the ASCII 1047297les by reading each code line separately into an
array using the C environment
(3) Ability to conduct simultaneous analyses LTM was not
designed to be used for different regions at the same time
LTM-HPC now uses a unique code for different regions in
XML format and can repeat all the processes simultaneously
for different regions
(4) Increased processing speed The previous version of LTM had
many disconnected steps (Pijanowski et al 2002a) which
were carried out sequentially using different DOS-level
commands All XML 1047297les are now uploaded into an HPC
environment and all modeling steps are automatically
processed
References
Adeloye AJ Rustum R Kariyama ID 2012 Neural computing modeling of thereference crop evapotranspiration Environ Model Softw 29 61e73
Anselme B Bousquet F Lyet A Etienne M Fady B Le Page C 2010 Modeling of spatial dynamics and biodiversity conservation on Lure mountain (France)Environ Model Softw 25 (11) 1385e1398
Bennett ND Croke BFW Guariso G Guillaume JHA Hamilton SH Jakeman AJ Marsili-Libelli S Newhama LTH Norton JP Perrin CPierce SA Robson B Seppelt R Voinov AA Fath BD Andreassian V 2013Environ Model Softw 40 1e20
Bertram P Stadler-Salt N Horvatin P Shear H 2003 Bi-national assessment of the Great Lakes SOLEC partnerships Environ Monit Assess 81 (1e3) 27e33
Bishop CM 1995 Neural Networks for Pattern Recognition Oxford UniversityPress Oxford
Bishop CM 2005 Neural Networks for Pattern Recognition Oxford UniversityPress ISBN 0-19-853864-2
Bonan GB 2008 Forests and climate change forcings feedbacks and the climatebene1047297ts of forests Science 320 (5882) 1444e1449
Boutt DF Hyndman DW Pijanowski BC Long DT 2001 Identifying potentialland use-derived solute sources to stream base1047298ow using ground water modelsand GIS Ground Water 39 (1) 24e34
Burton A Kilsby C Fowler H Cowpertwait P OrsquoConnell P 2008 RainSim aspatial-temporal stochastic rainfall modeling system Environ Model Softw 23(12) 1356e1369
Buyya R (Ed) 1999 High Performance Cluster Computing Architectures andSystems vol 1 Prentice Hall Englewood Cliffs NJ
Carpani M Bergez JE Monod H 2012 Sensitivity analysis of a hierarchicalqualitative model for sustainability assessment of cropping systems EnvironModel Softw 27e28 15e22
Chapman T 1998 Stochastic modelling of daily rainfall the impact of adjoiningwet days on the distribution of rainfall amounts Environ Model Softw 13 (3e4) 317e324
Cheung AL Reeves Anthony P 1992 High performance computing on a cluster of workstations HPDC 1992 152e160
Clarke KC Gazulis N Dietzel C Goldstein NC 2007 A decade of SLEUTHing lessons learned from applications of a cellular automatonland use change model In Classics from IJGIS Twenty Years of theInternational Journal of Geographical Information Systems and Sciencepp 413e425
Cox PM Betts RA Jones CD Spall SA Totterdell IJ 2000 Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate modelNature 408 (6809) 184e187
Dale VH 1997 The relationship between land-use change and climate changeEcol Appl 7 (3) 753e769
DavisAYPijanowskiBCRobinson KD KidwellPB 2010aEstimating parkinglot
footprintsin theUpperGreatLakes regionof theUSA Landsc UrbanPlan 96 (2)68e77
Davis AY Pijanowski BC Robinson K Engel B 2010b The environmental andeconomic costs of sprawling parking lots in the United States Land Use Policy27 (2) 255e261
Denoeux T Lengelleacute1993 Initializing back propagation networks with prototypesNeural Netw 6 351e363
Dietzel C Clarke K 2006 The effect of disaggregating land use categories incellular automata during model calibration and forecasting Comput EnvironUrban Syst 30 (1) 78e101
Dietzel C Clarke KC 2007 Toward optimal calibration of the SLEUTH land usechange model Trans GIS 11 (1) 29e45
Dixon RK Winjum JK Andrasko KJ Lee JJ Schroeder PE 1994 Integratedland-use systems assessment of promising agroforest and alternative land-usepractices to enhance carbon conservation and sequestration Clim Change 27(1) 71e92
Dlamini W 2008 A Bayesian belief network analysis of factors in1047298uencing wild1047297reoccurrence in Swaziland Environ Model Softw 25 (2) 199e208
ESRI 2011 ArcGIS 10 Software
Fei S Kong N Stinger J Bowker D 2008 In Ravinder K Jose S Singh HBatish D (Eds) Invasion Pattern of Exotic Plants in Forest Ecosystems InvasivePlants and Forest Ecosystems CRC Press Boca Raton FL pp 59e70
Fitzpatrick M Long D Pijanowski B 2007 Biogeochemical 1047297ngerprints of landuse in a regional watershed Appl Biogeochem 22 1825e1840
Foster I Kesselman C 1997 Globus a metacomputing infrastructure toolkit Int JSupercomput Appl 11 (2) 115e128
Foster DR Hall B Barry S Clayden S Parshall T 2002 Cultural environmentaland historical controls of vegetation patterns and the modern conservationsetting on the island of Martha rsquos Vineyard USA J Biogeogr 29 1381e1400
GLP 2005 Science Plan and Implementation Strategy IGBP Report No 53IHDPReport No 19 IGBP Secretariat Stockholm 64 pp
Grimm NB Redman CL 2004 Approaches to the study of urban ecosystems thecase of Central ArizonadPhoenix Urban Ecosyst 7 (3) 199e213
Guo LB Gifford RM 2002 Soil carbon stocks and land use change a metaanalysis Glob Change Biol 8 (4) 345e360
Herold M Goldstein NC Clarke KC 2003 The spatiotemporal form of urbangrowth measurement analysis and modeling Remote Sens Environ 86 (3)286e302
Herold M Couclelis H Clarke KC 2005 The role of spatial metrics in the analysisand modeling of urban land use change Comput Environ Urban Syst 29 (4)369e399
Hey AJ 2009 The Fourth Paradigm Data-intensive Scienti1047297c Discovery Jacobs A 2009 The pathologies of big data Commun ACM 52 (8) 36e44Kampe TU Johnson BR Kuester M Keller M 2010 NEON the 1047297rst continental-
scale ecological observatory with airborne remote sensing of vegetation canopybiochemistry and structure J Appl Remote Sens 4 (1) 043510e043510
Kaye JP Groffman PM Grimm NB Baker LA Pouyat RV 2006 A distincturban biogeochemistry Trends Ecol Evol 21 (4) 192e199
Kilsby C Jones P Burton A Ford A Fowler H Harpham C James P Smith AWilby R 2007 A daily weather generator for use in climate change studiesEnviron Model Softw 22 (12) 1705e1719
Lagabrielle E Botta A Dareacute W David D Aubert S Fabricius C 2010 Modelingwith stakeholders to integrate biodiversity into land-use planning lessonslearned in Reacuteunion Island (Western Indian Ocean) Environ Model Softw 25(11) 1413e1427
Lambin EF Geist HJ (Eds) 2006 Land Use and Land Cover Change Local Pro-cesses and Global Impacts Springer
LaValle S Lesser E Shockley R Hopkins MS Kruschwitz N 2011 Big data analytics and the path from insights to value MIT Sloan Manag Rev 52 (2)21e32
Lei Z Pijanowski BC Alexandridis KT Olson J 2005 Distributed modelingarchitecture of a multi-agent-based behavioral economic landscape (MABEL)model Simulation 81 (7) 503e515
Loepfe L Martiacutenez-Vilalta J Pintildeol J 2011 An integrative model of humanin1047298uenced 1047297re regimes and landscape dynamics Environ Model Softw 26 (8)1028e1040
Lynch C 2008 Big data how do your data grow Nature 455 (7209) 28e
29Mas JF Puig H Palacio JL Sosa AA 2004 Modeling deforestation using GIS andarti1047297cial neural networks Environ Model Softw 19 (5) 461e471
MEA Millennium Ecosystem Assessment 2005 Ecosystems and Human Well-being Current State and Trends Island Press Washington DC
Merritt WS Letcher RA Jakeman AJ 2003 A review of erosion and sedimenttransport models Environ Model Softw 18 (8e9) 761e799
Mishra V Cherkauer K Niyogi D Ming L Pijanowski B Ray D Bowling L2010 Regional scale assessment of land useland cover and climatic changes onsurface hydrologic processes Int J Climatol 30 2025e2044
Moore N Torbick N Lofgren B Wang J Pijanowski B Andresen J Kim DOlson J 2010 Adapting MODIS-derived LAI and fractional cover into theRegional Atmospheric Modeling System (RAMS) in East Africa Int J Climatol30 (3) 1954e1969
Moore N Alargaswamy G Pijanowski B Thornton P Lofgren B Olson JAndresen J Yanda P Qi J 2011 East African food security as in1047298uenced byfuture climate change and land use change at local to regional scales ClimChange httpdxdoiorg101007s10584-011-0116-7
Olson J Alagarswamy G Andresen J Campbell D Davis A Ge J Huebner M
Lofgren B Lusch D Moore N Pijanowski B Qi J Thornton P Torbick NWang J 2008 Integrating diverse methods to understand climate-land in-teractions in east Africa GeoForum 39 (2) 898e911
Pekin BK Pijanowski BC 2012 Global land use intensity and the endangermentstatus of mammal species Divers Distrib 18 (9) 909e918
Peralta J Li X Gutierrez G Sanchis A 2010 July Time series forecasting byevolving arti1047297cial neural networks using genetic algorithms and differentialevolution Neural Netw e IJCNN 1e8
Peacuterez-Vega A Mas JF Ligmann A 2012 Comparing two approaches to land usecover change modeling and their implications for the assessment of biodiver-sity loss in a deciduous tropical forest Environ Model Softw 29 11e23
Pickett ST Burch Jr WR Dalton SE Foresman TW Grove JM Rowntree R1997 A conceptual framework for the study of human ecosystems in urbanareas Urban Ecosyst 1 (4) 185e199
Pielke RA 2005 Land use and climate change Science 310 (5754) 1625e1626Pijanowski BC Long DT Gage SH Cooper WE 1997 June A Land Trans-
formation Model Conceptual Elements Spatial Object Class Hierarchies GISCommand Syntax and an Application for Michiganrsquos Saginaw Bay Watershed InSubmitted to the Land Use Modeling Workshop USGS EROS Data Center
BC Pijanowski et al Environmental Modelling amp Software 51 (2014) 250e 268 267
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919
8122019 A Big Data Urban Growth big dataSimulation at a National Scale
httpslidepdfcomreaderfulla-big-data-urban-growth-big-datasimulation-at-a-national-scale 1919