Smart Urban Planning Support through Web Data Science on Open and Enterprise Data

Smart Urban Planning Support through Web Data Science on Open and

Enterprise Data

Gloria Re Calegari and Irene Celino

CEFRIEL – Politecnico di Milano

The 24th International World Wide Web ConferenceFlorence, Italy

18 – 22 May 2015

Web Data Science meets Smart Cities19th May 2015

Digital information about cities

• Large number of data sources available on the web (Open data):• Urban planning (land cover, public registers)• Demographics and statistics about municipality

• User generated information:• Volunteered geographic information and crowdsourcing information (Open Street Map)• Location based social network (Foursquare check-ins and geo located information)

• Close data sources produced and maintained by enterprises:• Phone activity data

Cost of data management (collection, cleansing, maintenance) is highly variable with respect to the diverse data origins.

Research goal

Long term goal:

• Can we predict (generate or update) a costly dataset from a set of cheap information sources?

Cheap datasetsExpensive datasets

Predict or update

Our case study

• Data collection• Available datasets about Milano

• Problem of spatial granularities and pre-processing of the datasets

• Data processing• Definition of input/output

• Predictive analysis• Statistical learning

• Machine learning

• Results evaluation

Milano datasets

Demographics: • population density

• Spatial resolution: census area

• Source: Milano open data

Points of interest (POIs): • Trasports, schools, sports facilities, amenity places,

shops ...• Spatial resolution: lat-long points • Source: Milano open data (official) and Open Street

Map (user generated)

Milano datasetsLand use cover:

• type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined)• CORINE taxonomy

http://swa.cefriel.it/ontologies/corine.html#

• 5 type selected (which better feature metropolitan area as Milan)

1. Residential2. Agricultural3. Commercial/industrial4. Parks and green areas5. Sport centres

• Spatial resolution: building level • Source: Lombardy region open data

Milano datasetsCall data records:

• 5 phone activities • Incoming SMS• Outcoming SMS • Incoming CALL• Outcoming CALL • Internet

• Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013)

• Summarizing structure: a footprint for each cell (average activity over all the days, distinguishing between week and weekend days)

• Spatial resolution: grid of 3538 square cells of 250m• Source: Telecom Italia – provided for their Big Data Challenge

http://theodi.fbk.eu/openbigdata/

Pre-processingUniform the spatial resolution in order to make datasets comparable.

Spatial resolution used: grid of 3538 square cells of 250m

Overlapping and intersecting layers using QGIS software.

New datasets generated:• Presence/absence of POIs in each cell

• Weighted sum of population density in each cell

• Percentage shares of each land use over each cell area

Selection of input/output variables

Predictive models(regression)

Land use density:• Residential• Agricultural• Commercial• Green area• Sport facilities

Population density

Telecom data• means of each

phone activity (10 values)

• means hour-by-hour of all the activities (24 values)

POIs • School • Transport• Shop• Food• Sport• ...

Aims of the experiments

1. Comparing different regression algorithms1. Statistical Learning approach -> Multiple Linear Regression (MLR)

2. Machine Learning approach -> Random Forest (RF)

2. Evaluating how the number of predictors impacts the models performances1. All the predictors

2. Manual selection of a subset of predictors

3. Automatic selection of predictors by AIC (Akaike information criterion)

Tests performed

5 tests combining the different algorithms and inputs

All predictors Manual selection AIC selection

RF x x

MLR x x x

Methodology of the experiments

• Dividing dataset into training (90%) and test (10%) sets

• Training the model using the 10 fold cross validation to avoid overfitting

• Calculating the Adjusted R^2 Index to measure the goodness of the model (percentage of variance explained)

Results1) Different output results: some

variables are predicted better

2) Models comparison: RF always equals or outperforms MLR (data does not follow a linear distribution but a more complex one)

3) Number of predictors: RF-manual selection is usually better than RF-all and MLR AIC-selection is better than others MLR models. Higher the number of variables included in the model, the more the risk of overfitting (higher difference between R^2 of training and test set)

MLR – manual selection

MLR – all MLR – AIC selection

RF – all RF– manual selection

Adj R-square RF - all RF - manual selection

Train Test Train Test

population 0.668 0.623 0.604 0.591

residential 0.633 0.588 0.623 0.614

worse results in RF-manual selection

Predictors importance calculated by RF-all

7 vars in the top10 out of the manually selected

2 vars in the top10out of the manually selected

Variable selection is an essential step in optimizing a predictive model

better results in RF-manual selection

Conclusions

• Encouraging results in employing open and enterprise datasets in regression models

• Good results in predicting population, residential and agricultural areas -> explained variability reaching 62%

• There is a relation between land use/popoulation and diverse and heterogeneous datasets used as predictors (POIs and phone activity)

• Chosing the best predictors is an ‘’art’’. A lot of relevant data available about cities. A preprocessing phase is essential to select only the most informative and discriminative variables.

Future work• Improvements on input variables: preprocessing predictors to extract more

discriminative information from the data (changing the POIs data from presence/absence to distances from the closest POI )

• Improvements on output variables: definition of new outputs that are easier to predict experimentally (dense residential, sparse residential, agricultural, industrial/commercial, parks and natural stuff). Problems in predicting specific land uses (parks, sport centres) -> other kind of input data may be required.

• Improvements on predictive algorithms: better results using Support Vector Machine (SVM) -> the urban environment is so complex that cannot be modelled using linear models

• Reproducibility of our solution on different scenarios: comparable results obtained on other European cities (Barcelona, Muenchen and Brussels) -> the methodology proposed is successful.

Thank you! Any question?

Gloria Re Calegari and Irene Celino

CEFRIEL – Politecnico di Milano

Smart Urban Planning Support through Web Data Science on Open and Enterprise Data

Data & Analytics

Transcript of Smart Urban Planning Support through Web Data Science on Open and Enterprise Data

Agile Enterprise Data Warehousingceit.aut.ac.ir/~90131914/BI/Lecture Slides/Agile Enterprise Data... · Agile Enterprise Data Warehousing ... From Chaos to Architecture Data Mart

EDM-Enterprise Data Dictionary Standards · An established data dictionary provides numerous benefits for Federal ... Enterprise Data Dictionary ... An enterprise-wide data dictionary

Enterprise Data Breach

Urban Data and Urban Design: A Data Mining Approach in ...

Understanding Oracle Enterprise Data Quality · Oracle Fusion Middleware Understanding Oracle Enterprise Data Quality, 12c (12.2.1.3.0) ... • Integrating Enterprise Data Quality

Product Brochure Enterprise Management Data & Analytics · 2019-12-30 · Product Brochure – Enterprise Management Data & Analytics * Enterprise Management Data & Analytics was

Urban Enterprise Report Template

URBAN DATA COLLECTION URBAN DESIGN - Home | …unglobalpulse.org/sites/default/files/From Urban Data Collection to... · URBAN DATA COLLECTION & PARTICIPATORY ... architect and urban

EMEATMC client conference Enterprise data management · client conference Enterprise data management ... The technology challenge for tax Enterprise data management ... Deferred tax

Elkhart Urban Enterprise Zone's letter to residents

Enterprise Data Maps

Urban Data

LNCS 5823 - Semantic Enhancement for Enterprise Data ... · Semantic Enhancement for Enterprise Data Management ... Semantic Enhancement for Enterprise Data Management 877 [6]. ...

Urban Data Mapping

Urban and Enterprise Architectures: A Cross-Disciplinary Look ...yxk833/urbanandenterprise...Urban and Enterprise Architectures: A Cross-Disciplinary Look at Complexity Roger Sessions

Enterprise Data Management - assets.kpmg · Enterprise Data Management - Data Lineage. Title: Enterprise Data Management Author: KPMG Created Date: 10/11/2017 9:14:46 AM

Data Modelling - BCS modelling 1970-1990 Enterprise data management coordination Enterprise data integration Enterprise data stewardship Enterprise data use 1990-2000

Enterprise (Data) Architecture

The open semantic enterprise enterprise data meets web data

Urban design & data