Post on 27-Jul-2015
Smart Urban Planning Support through Web Data Science on Open and
Enterprise Data
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano
1
The 24th International World Wide Web ConferenceFlorence, Italy
18 – 22 May 2015
Web Data Science meets Smart Cities19th May 2015
Digital information about cities
• Large number of data sources available on the web (Open data):• Urban planning (land cover, public registers)• Demographics and statistics about municipality
• User generated information:• Volunteered geographic information and crowdsourcing information (Open Street Map)• Location based social network (Foursquare check-ins and geo located information)
• Close data sources produced and maintained by enterprises:• Phone activity data
Cost of data management (collection, cleansing, maintenance) is highly variable with respect to the diverse data origins.
2
Research goal
Long term goal:
• Can we predict (generate or update) a costly dataset from a set of cheap information sources?
Cheap datasetsExpensive datasets
Predict or update
3
Our case study
• Data collection• Available datasets about Milano
• Problem of spatial granularities and pre-processing of the datasets
• Data processing• Definition of input/output
• Predictive analysis• Statistical learning
• Machine learning
• Results evaluation
4
Milano datasets
Demographics: • population density
• Spatial resolution: census area
• Source: Milano open data
Points of interest (POIs): • Trasports, schools, sports facilities, amenity places,
shops ...• Spatial resolution: lat-long points • Source: Milano open data (official) and Open Street
Map (user generated)
5
Milano datasetsLand use cover:
• type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined)• CORINE taxonomy
http://swa.cefriel.it/ontologies/corine.html#
• 5 type selected (which better feature metropolitan area as Milan)
1. Residential2. Agricultural3. Commercial/industrial4. Parks and green areas5. Sport centres
• Spatial resolution: building level • Source: Lombardy region open data
6
Milano datasetsCall data records:
• 5 phone activities • Incoming SMS• Outcoming SMS • Incoming CALL• Outcoming CALL • Internet
• Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013)
• Summarizing structure: a footprint for each cell (average activity over all the days, distinguishing between week and weekend days)
• Spatial resolution: grid of 3538 square cells of 250m• Source: Telecom Italia – provided for their Big Data Challenge
http://theodi.fbk.eu/openbigdata/
7
Pre-processingUniform the spatial resolution in order to make datasets comparable.
Spatial resolution used: grid of 3538 square cells of 250m
Overlapping and intersecting layers using QGIS software.
New datasets generated:• Presence/absence of POIs in each cell
• Weighted sum of population density in each cell
• Percentage shares of each land use over each cell area
8
Selection of input/output variables
Predictive models(regression)
Land use density:• Residential• Agricultural• Commercial• Green area• Sport facilities
Population density
Telecom data• means of each
phone activity (10 values)
• means hour-by-hour of all the activities (24 values)
POIs • School • Transport• Shop• Food• Sport• ...
9
INP
UT
OU
TPU
T
Aims of the experiments
1. Comparing different regression algorithms1. Statistical Learning approach -> Multiple Linear Regression (MLR)
2. Machine Learning approach -> Random Forest (RF)
2. Evaluating how the number of predictors impacts the models performances1. All the predictors
2. Manual selection of a subset of predictors
3. Automatic selection of predictors by AIC (Akaike information criterion)
10
Tests performed
5 tests combining the different algorithms and inputs
All predictors Manual selection AIC selection
RF x x
MLR x x x
11
Methodology of the experiments
• Dividing dataset into training (90%) and test (10%) sets
• Training the model using the 10 fold cross validation to avoid overfitting
• Calculating the Adjusted R^2 Index to measure the goodness of the model (percentage of variance explained)
12
Results1) Different output results: some
variables are predicted better
2) Models comparison: RF always equals or outperforms MLR (data does not follow a linear distribution but a more complex one)
3) Number of predictors: RF-manual selection is usually better than RF-all and MLR AIC-selection is better than others MLR models. Higher the number of variables included in the model, the more the risk of overfitting (higher difference between R^2 of training and test set)
MLR – manual selection
MLR – all MLR – AIC selection
RF – all RF– manual selection
13
Adj R-square RF - all RF - manual selection
Train Test Train Test
population 0.668 0.623 0.604 0.591
residential 0.633 0.588 0.623 0.614
worse results in RF-manual selection
Predictors importance calculated by RF-all
14
7 vars in the top10 out of the manually selected
2 vars in the top10out of the manually selected
Variable selection is an essential step in optimizing a predictive model
better results in RF-manual selection
Conclusions
• Encouraging results in employing open and enterprise datasets in regression models
• Good results in predicting population, residential and agricultural areas -> explained variability reaching 62%
• There is a relation between land use/popoulation and diverse and heterogeneous datasets used as predictors (POIs and phone activity)
• Chosing the best predictors is an ‘’art’’. A lot of relevant data available about cities. A preprocessing phase is essential to select only the most informative and discriminative variables.
15
Future work• Improvements on input variables: preprocessing predictors to extract more
discriminative information from the data (changing the POIs data from presence/absence to distances from the closest POI )
• Improvements on output variables: definition of new outputs that are easier to predict experimentally (dense residential, sparse residential, agricultural, industrial/commercial, parks and natural stuff). Problems in predicting specific land uses (parks, sport centres) -> other kind of input data may be required.
• Improvements on predictive algorithms: better results using Support Vector Machine (SVM) -> the urban environment is so complex that cannot be modelled using linear models
• Reproducibility of our solution on different scenarios: comparable results obtained on other European cities (Barcelona, Muenchen and Brussels) -> the methodology proposed is successful.
16
17
Thank you! Any question?
Gloria Re Calegari and Irene Celino
CEFRIEL – Politecnico di Milano