Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf ·...

15
Incorporating GIS Building Data and Census Housing Statistics for Sub-Block-Level Population Estimation Shuo-Sheng Wu Texas State University—San Marcos and U.S. Geological Survey Le Wang University of Buffalo, State University of New York Xiaomin Qiu Missouri State University This article presents a deterministic model for sub-block-level population estimation based on the total building volumes derived from geographic information system (GIS) building data and three census block- level housing statistics. To assess the model, we generated artificial blocks by aggregating census block areas and calculating the respective housing statistics. We then applied the model to estimate populations for sub- artificial-block areas and assessed the estimates with census populations of the areas. Our analyses indicate that the average percent error of population estimation for sub-artificial-block areas is comparable to those for sub-census-block areas of the same size relative to associated blocks. The smaller the sub-block-level areas, the higher the population estimation errors. For example, the average percent error for residential areas is approximately 0.11 percent for 100 percent block areas and 35 percent for 5 percent block areas. Key Words: block population, dasymetric mapping, population estimation, population interpolation, sub-block. En este art´ ıculo se presenta un modelo determinista para el c´ alculo de la poblaci ´ on a nivel de sub-bloque con base en el volumen total de edificios derivado de datos sobre edificios obtenidos con el sistema de informaci ´ on geogr´ afica (geographic information system, GIS) y tres estad´ ısticas sobre vivienda a nivel de bloque del censo. Para evaluar el modelo, generamos bloques artificiales agregando ´ areas de bloques del censo y calculando las estad´ ısticas de vivienda respectivas. Entonces aplicamos el modelo para calcular las poblaciones de las ´ areas de los sub-bloques artificiales y evaluamos los c´ alculos con las poblaciones de las ´ areas del censo. Nuestro an´ alisis indica que el porcentaje medio de error del c ´ alculo de la poblaci ´ on en ´ areas de sub-bloques artificiales es comparable con el de las ´ areas de sub-bloques del censo del mismo tama ˜ no en relaci ´ on con los bloques asociados. Cuanto m ´ as peque ˜ nas sean las ´ areas a nivel de sub-bloque, m ´ as grandes ser ´ an los errores en el c ´ alculo de la poblaci ´ on. Por ejemplo, el porcentaje medio de error en ´ areas residenciales es de aproximadamente 0.11 por ciento en ´ areas con bloques de 100 por ciento, y un 35 por ciento en ´ areas con bloques de cinco por ciento. Palabras clave: poblaci ´ on en bloques, mapeo dasim ´ etrico, c´ alculo de la poblaci ´ on, interpolaci ´ on de la poblaci ´ on, sub-bloque. The Professional Geographer, 60(1) 2008, pages 121–135 C Copyright 2008 by Association of American Geographers. Initial submission, July 2006; revised submissions, November 2006 and April 2007; final acceptance, July 2007. Published by Taylor & Francis, LLC.

Transcript of Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf ·...

Page 1: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

Incorporating GIS Building Data and Census Housing

Statistics for Sub-Block-Level Population Estimation

Shuo-Sheng WuTexas State University—San Marcos and U.S. Geological Survey

Le WangUniversity of Buffalo, State University of New York

Xiaomin QiuMissouri State University

This article presents a deterministic model for sub-block-level population estimation based on the totalbuilding volumes derived from geographic information system (GIS) building data and three census block-level housing statistics. To assess the model, we generated artificial blocks by aggregating census block areasand calculating the respective housing statistics. We then applied the model to estimate populations for sub-artificial-block areas and assessed the estimates with census populations of the areas. Our analyses indicatethat the average percent error of population estimation for sub-artificial-block areas is comparable to thosefor sub-census-block areas of the same size relative to associated blocks. The smaller the sub-block-level areas,the higher the population estimation errors. For example, the average percent error for residential areas isapproximately 0.11 percent for 100 percent block areas and 35 percent for 5 percent block areas. Key Words:block population, dasymetric mapping, population estimation, population interpolation, sub-block.

En este artıculo se presenta un modelo determinista para el calculo de la poblacion a nivel de sub-bloque conbase en el volumen total de edificios derivado de datos sobre edificios obtenidos con el sistema de informaciongeografica (geographic information system, GIS) y tres estadısticas sobre vivienda a nivel de bloque del censo.Para evaluar el modelo, generamos bloques artificiales agregando areas de bloques del censo y calculando lasestadısticas de vivienda respectivas. Entonces aplicamos el modelo para calcular las poblaciones de las areasde los sub-bloques artificiales y evaluamos los calculos con las poblaciones de las areas del censo. Nuestroanalisis indica que el porcentaje medio de error del calculo de la poblacion en areas de sub-bloques artificialeses comparable con el de las areas de sub-bloques del censo del mismo tamano en relacion con los bloquesasociados. Cuanto mas pequenas sean las areas a nivel de sub-bloque, mas grandes seran los errores en el calculode la poblacion. Por ejemplo, el porcentaje medio de error en areas residenciales es de aproximadamente 0.11por ciento en areas con bloques de 100 por ciento, y un 35 por ciento en areas con bloques de cinco por ciento.Palabras clave: poblacion en bloques, mapeo dasimetrico, calculo de la poblacion, interpolacion de lapoblacion, sub-bloque.

The Professional Geographer, 60(1) 2008, pages 121–135 C© Copyright 2008 by Association of American Geographers.Initial submission, July 2006; revised submissions, November 2006 and April 2007; final acceptance, July 2007.

Published by Taylor & Francis, LLC.

Page 2: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

122 Volume 60, Number 1, February 2008

I n the United States, the most fine-grainedcensus population data available to the pub-

lic is at the block level. According to the U.S.Census Bureau (2003), “census blocks are areasbounded on all sides by visible features, suchas streets, roads, streams, and railroad tracks,and by invisible boundaries, such as city, town,township, and county limits, property lines,and short, imaginary extensions of streets androads.” In fact, the sizes of census blocks varygreatly. Using the city of Austin, Texas, as anexample, the boundary on one side of a censusblock can range from fifty meters in downtownto 10 km in the suburb; the population of acensus block can range from zero to 3,300.

For various purposes, people might need toestimate population for areas not coincidingwith census block boundaries or for areassmaller than a census block. For example,floodplains usually do not share the sameboundaries as those of census blocks, yet localgovernments might need to estimate flood-plain populations for flood hazard planning.Redevelopment subdivisions might not havethe same boundaries as those drawn for censusblocks, yet city planners and developers mightneed to estimate the number of local residentsfor planning and resource management pur-poses. Transportation engineers might need toestimate populations within a half-mile bufferof a proposed railroad or highway to assessthe potential impact. Demographers mightneed to estimate sub-block-level population insuburban areas where census blocks are usuallylarge for studying urban sprawl.

This article presents a deterministic modelfor sub-block-level population estimation. Themodel can estimate population for an arbitraryarea based on total building volumes withinthe area and three housing statistics for thearea. Using building volumes derived fromgeographic information system (GIS) build-ing data and housing statistics derived fromthe U.S. Census 2000 block-level data forour case study area in Austin, Texas, we ap-plied a simulation approach to assess the modelfor sub-block-level population estimation. Thesimulation approach generates artificial blocksby aggregating census block areas, calculateshousing statistics for the artificial blocks, esti-mates populations for sub-artificial-block areasbased on the deterministic model, and com-pares the estimates with census populations of

the areas. Using the simulation approach, wealso assessed whether the average percent errorof population estimation for sub-block-level ar-eas of the same size relative to associated blocksvaries with the block size, so that we can ap-ply the error statistics based on artificial blocksto those based on census blocks. Furthermore,we assessed how the rescaling for block pop-ulation preservation improves sub-block-levelestimates, and how incorporating land use in-formation affects the estimation accuracy. Theresults of this study have implications for re-searchers and practitioners who need popula-tion estimates at finer spatial resolution thanthe census block as well as accuracy assessmentfor the estimates.

Review of Past Population

Estimation Studies

Past population estimation studies can begrouped into three categories depending onthe data required for input in the estimation.The first category estimates areal populationfrom census zone-based population data us-ing certain mathematical interpolation func-tions. The second category infers populationfrom population-relevant physical or socioeco-nomic variables. The third category disaggre-gates census unit populations into zones that arepartitioned by population-relevant variables.The first category of studies requires only cen-sus population data as the input. The secondcategory of studies requires only population-relevant variables as the input. The last cate-gory of studies requires both census populationdata and population-relevant variables as theinput, and therefore the resulting estimates aregenerally more reliable.

Mathematical interpolation of censuspopulation can be divided into point-basedapproaches and area-based approaches (Lam1983). In point-based approaches, a controlpoint is assigned to represent each zone-basedcensus unit and its associated population.Then, using a certain mathematical functionof interpolation, a grid map of population isgenerated with grid point values estimatedfrom the control points (e.g., Martin 1989,1996; Bracken 1991). In contrast, area-basedapproaches directly use census zones as theunit of operation and transform the original

Page 3: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 123

zone-based population data to a representationof fine grids based on certain mathematicalfunctions (e.g., Tobler 1979; Rase 2001).

Population counts can be inferred from re-lated variables. Depending on the scale of esti-mation, past studies have estimated populationsfrom urban areas (Tobler 1969; Lo and Welch1977; Prosperie and Eyton 2000), land use ar-eas (Kraus, Senger, and Ryerson 1974; We-ber 1994; Lo 2003), dwelling unit counts (Hsu1971; Lo and Chan 1980; Lo 1989), image pixelstatistics (Webster 1996; Harvey 2002a; Liu,Clarke, and Herold 2006), and other relevantphysical or socioeconomic variables (Green andMonier 1959; Dobson et al. 2000; Liu andClarke 2002).

Census unit populations can be disaggre-gated into homogeneous zones delineated byspatial variables that are related to populationdistribution. This approach can be referred toas the dasymetric mapping method (Robin-son et al. 1995). The most commonly usedvariable for census population disaggregationis land use and land cover data (e.g., Yuan,Smith, and Limp 1997; Mennis 2003; Holt,Lo, and Hodler 2004). Other variables thathave been used include topography (Wright1936), election district demographic statistics(e.g., Flowerdew and Green 1989, 1991), roadnetworks (e.g., Xie 1995; Hawley and Moeller-ing 2005; Reibel and Bufalino 2005), remotesensing image spectral and textural statistics(Harvey 2002b; Liu, Clarke, and Herold 2006;Wu, Qiu, and Wang 2006), and other relevantphysical or socioeconomic variables (e.g., Dob-son et al. 2000; Liu and Clarke 2002).

To disaggregate census unit populationbased on relevant variables, researchers needto establish a mathematical relationship be-tween population counts and the variables. Forexample, when disaggregating census popula-tion based on the land use variable, researchershave to determine the population density foreach land use class or the population densityratio between land use classes before redis-tributing census unit population to differentland use zones. The mathematical relationshipbetween population counts and relevant vari-ables can be established from sampling (e.g.,Mennis 2003), from regression analysis (e.g.,Yuan, Smith, and Limp 1997; Wu, Qiu, andWang 2006), or based on domain knowledgeof researchers (e.g., Eicher and Brewer 2001).

Methods for Population Estimation

and Assessment

This study presents a model to estimate pop-ulation for small, sub-block areas based on thebuilding volume variable derived from build-ing footprint GIS data. The model can beapplied under the context of the second andthird categories of population estimation re-viewed previously. We evaluated both the orig-inal model estimates and the rescaled modelestimates (for census population preservation)that correspond to the two categories of pop-ulation estimation, respectively. Compared toother population-relevant variables, buildingvolumes have a straightforward and meaningfulrelationship with population counts, and build-ing footprints provide a direct and accuraterepresentation of where people are. Further-more, to connect building volumes to popu-lation counts, we incorporated three housingstatistics that are available from the census dataand present a deterministic model:

Pop = BdV/HuSpace∗OccRate∗HdSize, (1)

where Pop = population (the number of peo-ple), BdV = building volumes (e.g., cubic feet),HuSpace = average space per housing unit(e.g., average cubic feet per housing unit),OccRate = housing unit occupancy rate (per-centage), and HdSize = average household size(average number of persons per household).Equation (1) states that when the total buildingvolume within an area is divided by the aver-age space per housing unit of the area, the de-rived figure is the total number of housing unitswithin the area. Then, when the total numberof housing units is multiplied by the occupancyrate of the area, the derived figure is the totalnumber of households within the area. Further,when the total number of households is multi-plied by the average household size of the area,the derived figure is the total population withinthe area.

The deterministic model by nature is capableof estimating population for an arbitrary areabased on the total building volume and housingstatistics of the area. To estimate population forsub-block areas, we used housing statistics atthe block level so that the relationship betweenbuilding volumes and population counts canbe locally specified for each census block. For

Page 4: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

124 Volume 60, Number 1, February 2008

block-level population estimation, the modelshould provide a very close estimate of theactual block populations. However, when themodel is applied to estimate sub-block-levelpopulation, there are likely higher errors dueto the heterogeneous housing statistics withinblocks.

To assess the deterministic model, we firstused the model to estimate populations for720 residential census blocks and assessed theestimates as a benchmark of accuracy. We thenadopted a simulation approach to infer how ac-curately the model would estimate populationfor sub-census-block areas. The simulation ap-proach first generated artificial blocks by com-bining multiple census-block areas and derivingtheir respective housing statistics. Figure 1 il-

lustrates this approach, in which every twentyblock areas are aggregated as artificial blocks.The average household size for artificial blockswas calculated by dividing the total populationswithin the artificial block by the total numberof households. The household occupancy ratewas calculated by dividing the total number ofhouseholds within the artificial block by the to-tal number of housing units. The average spaceper housing unit was calculated in a similar fash-ion so that the housing statistic is the actualcensus figure.

After artificial blocks were generated, thesimulation approach estimated population forsub-artificial-block areas based on the deter-ministic model. Then the estimates were com-pared with census population of the areas for

Figure 1 Artificial blocks by every twenty-census-block aggregation.

Page 5: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 125

accuracy assessment. It is worth noting thatcensus-defined blocks and the artificially gen-erated blocks are in essence the same entitywith the same attributes of respective block-level housing statistics and the perceived mean-ing of homogeneity when used as a basis to es-timate their subarea populations.

To assess how the accuracy of sub-artificial-block estimates varies with the artificial blocksize, and also to make a connection of theaccuracy assessment from artificial blocks tocensus blocks for extrapolating the trends ofaccuracy and error statistics, we comparedthe accuracy statistics for sub-artificial-blockareas of the same size relative to artificialblocks (e.g., 50 percent artificial-block areas)based on different artificial-block sizes. Fur-thermore, we assessed how the rescaling ofsub-(artificial-)block estimates for preserving(artificial-)block populations affects the estima-tion accuracy by applying a single scaling factorto all sub-(artificial-)block estimates within thesame (artificial) block. To test if available landuse information can improve sub-block-levelpopulation estimation, we assessed sub-blockestimates for mixed-land-use blocks and com-pared them with those for residential blocks.

Compared to past population estimationstudies, this study is unique in three aspects.First, we inferred areal population from build-ing footprints and associated building volumes.A spatial unit of building area is a fine spa-tial unit that allows demarcation of popula-tion concentration areas at the sub-block level.A building is also an integral unit for count-ing population because people live and workin buildings. In addition, for the same type ofbuildings or buildings of the same land use,population counts are proportional to buildingvolumes. Although the commonly used spectraland textural statistics from high resolution (e.g.,1-m resolution) remote sensing images offer anopportunity to infer population counts at a finespatial scale, relevant studies (Liu, Clarke, andHerold 2006; Wu, Qiu, and Wang 2006) havenot shown results as satisfactory as this study.

Second, this study incorporated census hous-ing statistics to establish a mathematical rela-tionship between population counts and thebuilding volume variable. In contrast to sam-pling, regression analysis, or relying on do-main knowledge to establish a mathematicalrelationship between population counts and

population-relevant variables (reviewed in theprevious section), incorporating census-block-level statistics allows us to specify the relation-ship between population counts and buildingvolumes locally by each census block instead ofglobally across the entire data set. In addition,census data are readily available to the publicand are deemed quite reliable.

Finally, past population estimation stud-ies usually build population estimation mod-els from higher level census population data(which are aggregated from lower level censusdata) and then assess the models from lowerlevel census data. For example, population es-timation models might be built from census-tract-level data, then census-block-group-leveldata are used to verify the models (Eicher andBrewer 2001; Lo 2003). In this study, ourpopulation estimation model (the determinis-tic model) is based on the most fine-grainedcensus data at the census-block level for thepurpose of estimating sub-block-level popula-tions. Then we adopted a simulation approachto assess the sub-block-level estimates. Popu-lation estimation models based on block-levelstatistics are more accurate than those based onhigher level statistics because block-level statis-tics are closer to sub-block-level statistics thanstatistics from block groups or census tracts.

Study Area and Data Source

We selected an area of approximately 6 × 14 kmin the north central part of the city of Austin,the capital of Texas, as our study area. The city’sthree main thoroughfares, IH-35, MoPac, andHighway 183 run through it (Figure 2). Thearea has a total of 1,337 census blocks with avariety of residential and nonresidential landuse, old and new neighborhoods, and housingpatterns (Figure 3).

Data sets we used in this study include cen-sus geographic and demographic data, build-ing footprints data, aerial photographs, eleva-tion data, and land use data. The data are allfrom the year 2000. The census data were ob-tained from the U.S. Census Bureau’s Ameri-can FactFinder Web site (U.S. Census Bureau2006). The other three data sets were obtainedfrom the City of Austin Neighborhood Plan-ning and Zoning Department (NPZD), eitherdirectly downloaded from their File Transfer

Page 6: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

126 Volume 60, Number 1, February 2008

Figure 2 Study area in Austin, Texas.

Protocol (FTP) server (City of Austin 2005a)or acquired through personal contact.

In addition to census block geographies andpopulations, three block-level housing statisticswere directly downloaded or computed fromrelevant census statistics and GIS data: housingunit occupancy rate, average household size,and average space per housing unit.

The building footprints are in vectorpolygon format. NPZD itself only has build-ing footprints for the years 1997 and 2003available. Nevertheless, we generated building

footprints for the year 2000 for the core partof the city by comparing data sets of the twoavailable years and referencing with aerialphotographs from the year 2000. Specifically,for building footprints that do not changebetween the two years, we assume that theyalso exist for the year 2000. For the buildingfootprints that are inconsistent between thetwo years’ data sets, we visually referencedwith the high-spatial-resolution (0.61 m) aerialphotographs to decide which year’s data setto follow. For those inconsistent building

Page 7: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 127

Figure 3 Land use in the study area.

footprints that we had to visually check, aerialphotographs always matched either one of thetwo building data sets.

The building footprint data contain the av-erage altitude information for individual build-ing roofs. We inferred the building height fromthe elevation data of the ground surface. Thebuilding footprint data and the elevation datawere both generated by Analytical Surveys In-corporated (ASI), which contracted with thecity. ASI first manually digitized building foot-prints from high-resolution aerial photographs.

Then, by referencing with digital terrains gen-erated from remote sensing light detectionand ranging (LIDAR) data (0.61-m spatialresolution), ASI estimated the altitude for in-dividual building roofs as well as the groundsurface. The elevation data of the ground sur-face is in 0.61-m (2-feet) contour line format.We transferred it to grid format and estimatedthe average ground surface elevation for indi-vidual building footprints. Then, by subtractingground surface elevation from building roof al-titude, we estimated the height for individual

Page 8: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

128 Volume 60, Number 1, February 2008

buildings. Individual building volumes werethen derived by multiplying the building foot-print area with the building height.

It is worth noting that building volumes canalso be inferred from appraisal district tax par-cel data that contain information about the totalacreage of building areas categorized by theirfloor numbers. However, the building area in-formation is summarized by the entire par-cel. Because many multifamily land use parcelsare actually entire census blocks, the buildinginformation from appraisal districts is not suit-able for sub-block-level population estimation.Our land use data are in vector polygon format,generated and updated by the NPZD based on avariety of sources, including historical land usedata, Travis Central Appraisal District (TCAD)tax parcel data, the city parcels database, naturalpreserves GIS data, aerial photographs, build-ing footprint data, and field check information(City of Austin 2005b). The land use data havea spatial unit of tax parcels and are consideredquite accurate and reliable.

Assessing Block-Level Population

Estimation

We first assessed how the deterministic modelestimates populations for census blocks withinresidential land use areas, specifically single-family and multifamily land use areas, whichmake up more than 92 percent of the totalresidential land use areas within the city limit(City of Austin 2005a). By referencing the landuse data, we identified 650 single-family blockswithin the study area. We further picked 600single-family blocks that are more connected ingeographical boundaries, so that nearby blockscan be more intuitively aggregated into arti-ficial blocks in the simulation analysis. As formultifamily blocks, only thirty-four of them arefound in the study area. To increase the num-ber of samples, we searched the entire Austinarea and selected a total of 120 blocks that areentirely or mostly within multifamily land use.

After 720 residential blocks were selected,we calculated the total building volumes for in-dividual blocks by overlaying with the buildingfootprint data. The deterministic model (Equa-tion [1]) was then applied to estimate popula-tions for individual sample blocks. We com-pared the model estimates with the actual blockpopulations from the census and calculated the

average percent error as:

average percent error

= 1m

m∑

i=1

|Pi − Yi |Yi

× 100% (2)

where Pi is the model-estimated population forthe ith census block, Yi is the reported censuspopulation for the ith census block, and m isthe number of census blocks under investiga-tion. The average percent error gives the aver-age percent of the original census block popu-lation that is underestimated or overestimated.A smaller average percent error indicates moreaccurate estimates from the model. Because theerror statistic is simple and straightforward, it issuitable for us to compare estimation accuracyunder different contexts regarding block size,rescaling adjustment, and land use information.

We calculated the average percent error ofpopulation estimation for the 600 single-familyblocks and the 120 multifamily blocks, respec-tively (Table 1). The results are quite satisfac-tory, with the average percent error less than0.15 percent for both land use types. Multifam-ily blocks have higher estimation errors thansingle-family blocks. By examining the stan-dard deviation of census-block-level popula-tion, we observe that the multifamily blockshave a more varied population distribution thanthat of single-family blocks, which might causemore uncertainty and errors in estimating pop-ulation for multifamily blocks.

Assessing Sub-Block-Level

Population Estimation

For sub-block-level population estimation, thedeterministic model is likely to produce higher

Table 1 The average percent error of populationestimates and the standard deviation of censuspopulation for 600 single-family blocks and 120multifamily blocks

600 single- 120 multi-family blocks family blocks

Average percent error ofpopulation estimates(%)

0.10 0.14

Standard deviation ofcensus population(persons)

34 573

Page 9: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 129

errors than for block-level estimation becausehousing statistics are not uniform withinindividual blocks and block-level housingstatistics cannot be well applied to sub-blockareas. Given that sub-block-level populationsare unavailable to assess sub-block estimatesfrom the deterministic model, we adopted asimulation approach for the sub-block-levelestimate assessment. Specifically, we firstgenerated artificial blocks by aggregating everytwenty neighboring or nearby census-blockareas. The three housing statistics of Hu-Space, OccRate, and HdSize for the artificialblocks were also derived from the census data.Then, within each artificial block, we aggre-gated census-block areas of every two to everynineteen blocks, respectively, as sub-artificial-block areas. Populations for the sub-artificial-block areas were estimated based on thedeterministic model. Finally, model estimatesfor sub-artificial-block areas were comparedwith census populations of the areas for accu-racy assessment. We performed the simulationprocesses for the 600 single-family blocks,the 120 multifamily blocks, and the combined720 residential blocks, respectively, and fordifferent sizes of sub-artificial-block areas. Theaverage percent error for the same size of sub-artificial-block areas relative to the associatedartificial blocks are graphed in Figures 4 and

5, in which sub-artificial-block areas are rep-resented as percentages of associated artificialblocks. For example, a sub-artificial-block areafrom an aggregation of ten census-block areasis 50 percent of its associated artificial block,which is an aggregation of twenty census-blockareas. Figures 4 and 5 show that the averagepercent error has an overall increasing trendwith decreasing sub-artificial-block areas.Multifamily areas have higher errors anduncertainties (represented by error bounds)of population estimation than those forsingle-family areas. For example, populationestimation for the 50 percent artificial-blockareas has approximately 18 percent averagepercent error, with 9 percent variation formultifamily areas, and approximately 8 percentwith 4 percent variation for single-family areas.The higher errors for multifamily areas aredue to the more heterogeneous populationand housing characteristics than those forsingle-family areas.

Assessing Sub-Block Estimates at

Different Block Sizes

To investigate how the average percent errorof population estimates for the same size ofsub-artificial-block areas (relative to artificial

Figure 4 The average percent error of population estimates for sub-artificial-block areas of single-family(Sf), multifamily (Mf), and combined residential (Sf + Mf) land use.

Page 10: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

130 Volume 60, Number 1, February 2008

Figure 5 Upper bound (UB), middle bound (MB), and lower bound (LB) of the average percent error ofpopulation estimates for sub-artificial-block areas of single-family (SF) and multifamily (MF) land use.

blocks) vary with the artificial-block size,we generated artificial blocks from the 720residential blocks based on every twenty to twocensus-block aggregations (twenty-block ag-gregations, nineteen-block aggregations, . . . ,two-block aggregations) and assessed popula-tion estimates for the 50 percent artificial-blockareas for all aggregation schemes. A graph ofthe average percent error against the artificial-block size is graphed in Figure 6, in which theartificial-block size is represented as the num-ber of aggregated block areas. Figure 6 showsthat the average percent error of populationestimation for the 50 percent artificial-blockareas does not increase or decrease consistentlywith the artificial-block size. It indicates thatpopulation estimation errors for the same size

of sub-artificial-block areas would be similar(with a variation of 3 percent) regardlessof the artificial-block size. Therefore, whenextrapolating the constant error trend tosmaller artificial blocks, such as one censusblock, we can logically infer the populationestimation errors for sub-census-block areas.In other words, the derived error graphs forsub-artificial-block areas (Figures 4 and 5) canbe applied for sub-census-block areas.

Assessing Effects of Block Population

Preservation on Sub-Block Estimates

In the context of census-unit population dis-aggregation, rescaling subunit estimates is a

The average percent error of popu-lation estimates for the 50 percentartificial-block areas of different arti-ficial block sizes (Bk = blocks). Fig-

ure 6

Page 11: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 131

standard population estimation procedure, inwhich individual census-unit populations arepreserved; that is, the summed total of rescaledsubunit estimates within a census unit is equalto the original census-unit population. Therescaled subunit estimates are more reliablebecause of the population preservation. Toinvestigate whether and how (artificial-)blockpopulation preservation improves populationestimation for different sizes of sub-(artificial-)block areas, we compared the estimation errorsbefore and after the rescaling. Block popula-tion preservation can be achieved by applying asingle scaling factor to all sub-block estimateswithin the same block. A scaling factor is there-fore the ratio between the “true” populationof a block and the summed total of the sub-block estimates. From another point of view, itis a transfer coefficient or correction factor fora sub-block estimate that is multiplied to yieldan adjusted sub-block estimate. After the rescal-ing procedure, population estimates for blockareas will be the accurate estimates, whereasestimates for sub-block areas will still haveerrors if the sub-block areas do not have thesame housing statistics (and associated scalingfactors) as those of the associated blocks, whichis the likely situation.

Previously we estimated sub-block-levelpopulation for residential areas and assessed theestimates (Figure 4). We then rescaled the sub-block estimates and calculated the error statis-tic. Specifically, we first summed up sub-blockestimates to respective block boundaries. Wethen divided the summed figures by the actualblock populations (from the census) to obtain a

single scaling factor for all sub-block estimateswithin each block. The scaling factors were ap-plied to upscale or downscale sub-block esti-mates, and the rescaled estimates were com-pared to actual populations of the sub-blockareas to derive error statistics. Finally, we com-pared the error statistics of rescaled sub-blockestimates with those of original sub-block esti-mates. The results show that the rescaling pro-cedure improves population estimation for allsizes of sub-block areas and not in a consistenttrend (Figure 7). The rescaling particularly im-proves population estimation for the 100 per-cent block areas due to the fact that the rescaledestimates for the 100 percent block areas arenow the accurate estimates.

Assessing Sub-Block Estimates

When Land Use Information Is Not

Available

Residential land use information is not alwaysavailable when estimating sub-block-level pop-ulations. To assess the sub-block estimates ofmixed land use, we applied our simulationprocedures to a total of 1,320 census blocksthat contain residential and nonresidential landuse in our study area. Specifically, we firstgenerated artificial blocks by aggregating ev-ery twenty neighboring census blocks and cal-culating the housing statistics for the artifi-cial blocks. Populations for sub-artificial-blockareas of different sizes were then modeled,rescaled, and assessed using the procedures aspreviously described. We compared the errorstatistics for mixed-land-use areas with those

Figure 7 The improvement of average percent error of population estimation for sub-artificial-blockareas after the rescaling for preserving artificial-block populations.

Page 12: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

132 Volume 60, Number 1, February 2008

Figure 8 Average percent error of population estimation for sub-artificial-block areas of mixed land useand residential land use.

for residential-land-use areas. The results showthat mixed-land-use areas have higher sub-block-level population estimation errors, par-ticularly for small sub-block areas (Figure 8).The reason for higher errors in mixed-land-useareas is their relatively heterogeneous popula-tion and housing characteristics. From anotherpoint of view, some of the buildings in mixed-land-use areas are nonresidential buildings thatdo not contain residents, leading to inaccuratepopulation estimation.

Discussion

The block-level estimates in this study haveaverage percent errors less than 0.15 percent.Compared to a relevant study by Wu, Qiu, andWang (2006) that inferred block-level popu-lation from image textural statistics and landuse information with resulting average per-cent errors larger than 11 percent, this studyhas great improvements. As a matter of fact,Wu, Qiu, and Wang (2006) obtained rela-tively high regression coefficients (R2 ≥ 0.67)between population and pixel statistics of high-resolution images to use for population esti-mation compared to other similar studies (e.g,Liu, Clarke, and Herold 2006). The compar-ison indicates that inferring fine-scale popu-lation based on building volumes and cen-

sus housing statistics will be more accuratethan estimates based on (conventional) vari-ables of image pixel statistics or land useinformation.

A common conclusion drawn from past pop-ulation estimation studies is that population es-timation for small areas often has higher errorsthan for large areas (Lo 1995; Harvey 2002a).This study also has similar findings. The de-terministic model estimates block-level popu-lations with a high degree of accuracy, but theestimation errors become higher for smallersub-block areas. For example, the average per-cent error of population estimation for residen-tial areas is approximately 0.11 percent for 100percent block areas, 15 percent for 50 percentblock areas, and 35 percent for 5 percent blockareas (Figure 4). A potential way to improvethe model estimates for small sub-block areas isto obtain local, sub-block-level housing statis-tics to use in the model. They can be obtainedthrough field surveys.

There are a variety of housing patterns andcharacteristics in different cities and differ-ent regions. For example, a new developmentin a fast-growing city might contain manylarge multifloored apartments arranged rela-tively densely, and the occupancy rate is rel-atively low. On the other hand, an old resi-dential neighborhood in an old city might havesmall houses with large yard spaces, and the

Page 13: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 133

occupancy rate is relatively high. Neverthe-less, the presented deterministic model is a ro-bust model that can be applied to varied U.S.cities because it makes use of census-block-levelhousing statistics.

Sub-block-level population estimation willbe more accurate when local census blocksare at sufficient spatial resolution to capturethe variation of local housing patterns. For ex-ample, cities with well-defined zoning regula-tions generally have more homogeneous andlarge-patch housing patterns, and census blockswithin the city limits will be relatively small andable to capture the variation of local housingpatterns. As a result, sub-block-level populationestimates based on block-level housing statis-tics will be relatively accurate. On the otherhand, in fast-growing suburban areas, censusblocks are generally large compared to housingpattern patches, and sub-block-level estimatesbased on block-level housing statistics will havehigher errors. Because sub-block populationestimation errors will vary between cities andbetween different regions of a city dependingon the homogeneity of local land use, housing,and population distribution patterns, the errorgraphs for sub-block-level population estimatesderived in this study might not be applicable toother cities and regions. Researchers will needto recalculate the error graphs of sub-block esti-mates using the proposed simulation approach.

A limitation of the presented fine-scale pop-ulation estimation model is that it relies onbuilding footprints and associated building vol-umes for model input. In the past, no effectiveway of automatically extracting building areasand heights existed. Researchers mainly relyon manually identifying and counting dwellingunits from high-spatial-resolution aerial pho-tographs, even though visual interpretation islaborious and time consuming. The buildingfootprints and building volume data used inthis study also involve intensive human inter-pretation and manual work in the derivationprocess. With the advance of very high-spatial-resolution satellite images, such as IKONOSand QuickBird, and the improvement of fea-ture extraction techniques, automatic extrac-tion of dwelling units from satellite images hasbecome possible ( Jin and Davis 2005; Kim,Lee, and Kim 2006). Another prospect for au-tomatic building extraction has come with theadvancement of three-dimensional object ex-

traction techniques from LIDAR data (Chen2007). Building footprints and volumes ex-tracted from LIDAR have shown great accu-racy improvements in recent years (Forlani etal. 2006; Zhang, Yan, and Chen 2006). Withthese new remote sensing data and buildingextraction techniques, building footprints andbuilding volumes will become more availablefor population estimation.

Another limitation regarding data source isthat the model relies on existing census hous-ing statistics. In other words, if the model isto be applied for noncensus years, an assump-tion regarding the timeliness of input data mustbe made or additional up-to-date data must beacquired. For example, if we want to infer fine-scale populations for the year 2006 using thedeterministic model, we can either use hous-ing statistics from the Census 2000 data orcollect up-to-date housing statistics for modelinput.

There could be several sources of error in thisstudy due to data quality and accuracy issues,such as spatial misalignment between censusblock geographies and building footprint data,census data miscounting, and houses unoccu-pied or otherwise under construction. How-ever, when the sub-block-level estimates areconstrained by census block totals, all errors arealso constrained within block-level estimates.Therefore, errors due to data quality and ac-curacy issues will only have an impact on pop-ulation mapping and estimation for sub-blockareas.

Conclusions

This article presents a deterministic model forsub-block-level population estimation based onGIS building volume data and three censusblock-level housing statistics, including the av-erage space per housing unit, the housing unitoccupancy rate, and the average household size.Model estimates for sub-block-level popula-tions are assessed using a simulation approachby generating artificial blocks that are arealaggregations of census blocks and have cor-responding housing statistics. The simulation-based assessment further shows that the aver-age percent error of population estimation forsub-block areas of the same size relative to theassociated blocks is constant regardless of the

Page 14: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

134 Volume 60, Number 1, February 2008

block size. Therefore, it is reasonable to in-fer the error statistic based on (larger) artifi-cial blocks to those based on (smaller) censusblocks. The results show that the smaller thesub-block areas, the higher the estimation er-rors. For example, the average percent errorfor residential land use areas is approximately0.11 percent for the 100 percent block areas, 15percent for 50 percent block areas, and 35 per-cent for 5 percent block areas. Furthermore,our analyses show that the rescaling of sub-block estimates for block population preser-vation improves population estimates for allsub-block areas, and the improvements are notrelated to the sizes of sub-block areas. Popula-tion estimates have higher errors for multifam-ily areas than for single-family areas, and formixed-land-use areas than for residential areas,particularly for small sub-block areas, due tothe more heterogeneous housing characteris-tics and population distributions of multifamilyand mixed-land-use areas, respectively. As a re-sult, detailed land use information will helpwith more accurate and reliable sub-block-levelpopulation estimation. �

Literature Cited

Bracken, I. 1991. A surface model approach to smallarea population estimation. Town Planning Review62 (2): 225–37.

Chen, Q. 2007. Airborne Lidar data processing andinformation extraction. Photogrammetric Engineer-ing and Remote Sensing 73 (2): 109–12.

City of Austin. 2005a. City of Austin GISdata sets. ftp://coageoid01.ci.austin.tx.us/GIS-Data/Regional/coa gis.html (last accessed 6 June2006).

———. 2005b. Land use survey methodology.http://www.ci.austin.tx.us/landuse/survey.htm(last accessed 6 June 2006).

Dobson, J. E., E. A. Bright, P. R. Coleman, R. C.Durfee, and B. A. Worley. 2000. LandScan: Aglobal population database for estimating popu-lations at risk. Photogrammetric Engineering and Re-mote Sensing 66 (7): 849–57.

Eicher, C. L., and C. A. Brewer. 2001. Dasymetricmapping and areal interpolation: Implementationand evaluation. Cartography and Geographic Infor-mation Science 28 (2): 125–38.

Flowerdew, R., and M. Green. 1989. Statistical meth-ods for inference between incompatible zonal sys-tems. In Accuracy of spatial databases, ed. M. Good-child and S. Gopal, 239–47. London: Taylor &Francis.

———. 1991. Data integration: Statistical methodsfor transferring data between zonal systems. InHandling geographical information: Methodology andpotential applications, ed. I. Masser and M. Blake-more, 38–54. New York: Wiley.

Forlani, G., C. Nardinocchi, M. Scaioni, and P. Zin-garetti. 2006. Complete classification of raw LI-DAR data and 3D reconstruction of buildings. Pat-tern Analysis and Applications 8 (4): 357–74.

Green, N. E., and R. B. Monier. 1959. Aerial pho-tographic interpretation of the human ecology ofthe city. Photogrammetric Engineering 25:770–73.

Harvey, J. T. 2002a. Estimating census district pop-ulations from satellite imagery: Some approachesand limitations. International Journal of RemoteSensing 23 (10): 2071–95.

———. 2002b. Population estimation models basedon individual TM pixels. Photogrammetric Engi-neering and Remote Sensing 68 (11): 1181–92.

Hawley, K., and H. Moellering. 2005. A compara-tive analysis of areal interpolation methods. Car-tography and Geographic Information Science 32 (4):411–23.

Holt, J. B., C. P. Lo, and T. W. Hodler. 2004.Dasymetric estimation of population densityand areal interpolation of census data. Cartog-raphy and Geographic Information Science 31 (2):103–21.

Hsu, S. Y. 1971. Population estimation. Photogram-metric Engineering 37:449–54.

Jin, X., and C. H. Davis. 2005. Automated build-ing extraction from high-resolution satellite im-agery in urban areas using structural, contextual,and spectral information. Eurasip Journal on Ap-plied Signal Processing 2005 (14): 2196–206.

Kim, T., T. Lee, and K. Kim. 2006. Semiauto-matic building line extraction from Ikonos imagesthrough monoscopic line analysis. PhotogrammetricEngineering and Remote Sensing 72 (5):541–49.

Kraus, S. P., L. W. Senger, and J. M. Ryerson. 1974.Estimating population from photographically de-termined residential land use types. Remote Sensingof Environment 3 (1): 35–42.

Lam, N. 1983. Spatial interpolation methods: A re-view. The American Cartographer 10 (2): 129–49.

Liu, X., and K. C. Clarke. 2002. Estimation of res-idential population using high resolution satelliteimagery. In Proceedings of the 3rd Symposium in Re-mote Sensing of Urban Areas, June 11–13, 2002, ed.D. Maktav, C. Juergens, and F. Sunar-Erbek, 153–60. Istanbul, Turkey: Istanbul Technical Univer-sity Press.

Liu, X., K. C. Clarke, and M. Herold. 2006. Pop-ulation density and image texture: A comparisonstudy. Photogrammetric Engineering & Remote Sens-ing 72 (2): 187–96.

Lo, C. P. 1989. A raster approach to populationestimation using high-altitude aerial and space

Page 15: Incorporating GIS Building Data and Census Housing ...lewang/pdf/lewang_sample17.pdf · geogr´afica (geographic information system, GIS) y tres estad ´ısticas sobre vivienda a

GIS Building Data and Housing Statistics for Sub-Block-Level Population Estimation 135

photographs. Remote Sensing of Environment 27 (1):59–71.

———. 1995. Automated population and dwellingunit estimation from high-resolution satelliteimages—A GIS approach. International Journal ofRemote Sensing 16 (1): 17–34.

———. 2003. Zone-based estimation of popula-tion and housing units from satellite-generatedland use/land cover maps. In Remotely sensed cities,ed. V. Mesev, 157–80. New York: Taylor &Francis.

Lo, C. P. and H. F. Chan. 1980. Rural population es-timation from aerial photographs. PhotogrammetricEngineering and Remote Sensing 46 (3): 337–45.

Lo, C. P. and R. Welch. 1977. Chinese urban popula-tion estimates. Annals of the Association of AmericanGeographers 67 (2): 246–53.

Martin, D. 1989. Mapping population data from zonecentroid locations. Transactions of the Institute ofBritish Geographers 14 (1): 90–97.

———. 1996. An assessment of surface and zonalmodels of population. International Journal of Geo-graphical Information Systems 10 (8): 973–89.

Mennis, J. 2003. Generating surface models of pop-ulation using dasymetric mapping. The ProfessionalGeographer 55 (1): 31–42.

Prosperie, L., and R. Eyton. 2000. The relationshipbetween brightness values from a nighttime satel-lite image and Texas county population. The South-western Geographer 4:16–29.

Rase, W. 2001. Volume-preserving interpolation of asmooth surface from polygon-related data. Journalof Geographical Systems 3 (2): 199–213.

Reibel, M., and M. E. Bufalino. 2005. Street-weighted interpolation techniques for demo-graphic count estimation in incompatible zone sys-tems. Environment and Planning A 37 (1): 127–39.

Robinson, A., J. Morrison, P. Muehrcke, A. Kimer-ling, and S. Guptill. 1995. Elements of cartography.New York: Wiley.

Tobler, W. R. 1969. Satellite confirmation of settle-ment size coefficients. Area 1 (3): 30–34.

———. 1979. Smooth pyncophylactic interpolationfor geographical regions. Journal of the AmericanStatistical Association 74 (367): 519–30.

U.S. Census Bureau. 2003. 2000 Census of pop-ulation and housing selected appendixes. http://www.census.gov/prod/cen2000/phc-2-a.pdf (lastaccessed 15 October 2006).

———. 2006. The U.S. Census AmericanFactFinder. http://factfinder.census.gov/home/saff/main.html? lang=en (last accessed 6 June2006).

Weber, C. 1994. Per-zone classification of urban landuse cover for urban population estimation. In Envi-

ronmental remote sensing from regional to global scales,ed. G. M. Foody and P. J. Curran, 142–48. NewYork: Wiley.

Webster, C. J. 1996. Population and dwelling unit es-timation from space. Third World Planning Review18 (2): 155–76.

Wright, J. K. 1936. A method of mapping densities ofpopulation. The Geographical Review 26 (1): 103–10.

Wu, S., X. Qiu, and L. Wang. 2006. Using semi-variance image texture statistics model populationdensities. Cartography and Geographic InformationScience 33 (2): 127–40.

Xie, Y. 1995. The overlaid network algorithms forareal interpolation problem. Computers Environ-ment and Urban Systems 19 (4): 287–306.

Yuan, Y., R. M. Smith, and W. F. Limp. 1997. Re-modeling census population with spatial informa-tion from LandSat TM imagery. Computers, Envi-ronment and Urban Systems 21 (3–4): 245–58.

Zhang, K., J. Yan, and S. Chen. 2006. Automaticconstruction of building footprints from airborneLIDAR data. IEEE Transactions on Geoscience andRemote Sensing 44 (9): 2523–33.

SHUO-SHENG WU is a Program Faculty inthe Department of Geography at Texas StateUniversity—San Marcos, 601 University Drive,San Marcos, TX 78666, and a UCGIS Postdoc-toral Fellow at the U.S. Geological Survey, 1400Independence Road, Rolla, MO 65401. E-mail:[email protected]. His research interests include GIS,remote sensing, and spatial statistics applications inland use classification, population estimation, hazardrisk assessment, and watershed pollution modeling,as well as scale and resolution issues in spatial dataanalysis.

LE WANG is an Assistant Professor in the Depart-ment of Geography at the University of Buffalo,State University of New York, Wilkeson Quad,Buffalo, NY 14261. E-mail: [email protected] research interests include development ofnew methods for remote sensing, mangrove forestcharacterization, invasive species modeling, andurban population estimation.

XIAOMIN QIU is an Assistant Professor in theDepartment of Geography, Geology, and Plan-ning at Missouri State University, 901 S. Na-tional Avenue, Springfield, MO 65897. E-mail:[email protected]. Her research interests in-clude GIScience education, mapping, geovisualiza-tion, spatial cognition, applications of GIS and re-mote sensing, and distance learning.