Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ PAPIs Connect
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs Connect
-
Upload
papisio -
Category
Technology
-
view
343 -
download
2
Transcript of The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs Connect
Big Data for Social Good
Nuria Oliver, PhD
Scientific Director
Telefonica R&D
7+ billion mobile phones worldwide
97% of world’s population (ITU)
Mobile penetration of 120% to 89% of population (ITU)
Emerging and developed regions
More time spent on our phones than watching TV or with our with
our partner (US and UK)
Mobile Network Infrastructure
● A cell tower (also known as BTS or base station) provides an approximate location
● Spacing between towers is 500m-1km in urban areas, 2-3 km in rural areas
● Each tower has a “Voronoi cell”; this is the area to which the tower provides best coverage i.e. tower is closest
Typical Mobile Data• CDR
• SMS
Consumption Social Network Mobility
Call duration In/Out Degree Radius of gyration
N. Events Delta w.r.t time window Travelled distance
Lapse between events Unique Calls per dayRate of popular
antennas
Reciprocated events Unique SMS per dayRegularity of popular
antennas
… … …
HR_ORG TLFN_A TLFN_B CD_GEO_A CD_GEO_B DT_ORG CD_SNTD CD_ERB CD_CCC QT_DUR
20:05:31 XXX YYY 3 11 20140519 2 1562 568 33
… … … … … … … … … …
HR_ORG TLFN_A TLFN_B CD_GEO_A CD_GEO_B DT_ORG CD_SNTD QT_TRFG
15:53:54 XXX ZZZ 3 25 20140506 2 1
… … … … … … … …
Cell
Phone
Data
BEHAVIORAL
INSIGHTS
Urban
Planning
Tools
Official
Statistics
Tools
TELEFONICA RESEARCH INSTITUTIONS & POLICY MAKERS
Crisis
Management
Tools
Public
Health Tools
Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floodshttp://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
Natural Disasters
Flooding Through the Lens of Mobile Phone Activity
David Pastor-Escuredo1, Alfredo Morales-Guzman1, Yolanda Torres-Fernandez1, Jean-Martin Bauer2, Amit Wadhwa2, Carlos
Castro-Correa3, Liudmyla Romanoff4, Jong Gun Lee4, Alex Rutherford4, Vanessa Frias-Martinez5, Nuria Oliver6, Enrique Frias-
Martinez6, Miguel Luengo-Oroz4
1. Universidad Polite cnica de Madrid
2. Vulnerability Analysis and Mapping, World Food Programme
3. Coordinacion de Estrategia Digital Nacional, Presidencia de la Repu blica de Me xico
4. United Nations Global Pulse
5. University Maryland
6. Telefonica Research
UN Global Pulse is an innovation initiative of the United Nations Secretary-
General, driving a big data revolution for development.
Its vision is to see big data used safely and responsibly for public good; its
mission to accelerate discovery, development and adoption of big data
and real-time analytics for sustainable development and humanitarian
action.
Tabasco Floods
• Tabasco state frequently suffers flooding
• 800mm of rain 31st October – 3rd November 2009 (4x
normal)
• 214,000 people affected
• State of emergency declared
Reconstructing the Floods
• We combined mobile data with
NASA LANDSAT satellite imagery
data, governmental surveys and
census data
• NASA LANDSAT geolocation of
the most affected areas
• Census data representativeness
of mobile data
• Precipitations data and civil
protection reports timeline of
people’s awareness and response
CDR Representativeness ● Mobile user
representativeness validated against census data
● Population estimates based on CDRs are strongly correlated with official Census data and population statistics
● Mobile data proxy of population distribution in areas where other data sources are not available or reliable
R2=0.97
Event Detection
x(t) is raw number of unique phones
placing or receiving calls in each
antennae per day
● Quantify impact of floods by comparing behavior of BTS against baseline activity using z-score like metric
● BTS with higher variations in the number of calls made during the floods were located in the most affected regions
● Special events (e.g. Christmas) led to similar changes in behavior wrtbaseline, but across the entire region as opposed to in specific locations
Actionable Insights
● Patterns in BTS activity can be used to measure the impact of floodings
● Behavioral insights are important for emergency services ● People communicated more as a
result of the initial impacts of the floodings, rather than following the recommendations of public protection
● Civil protection warnings did not seem to be an effective way to raise awareness
● Spikes in activity are observed only after the situation was critical
Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floodshttp://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
Pandemics
Vanessa Frias-Martinez, Enrique Frias-Martinez et al
Data Analysis
• Call Records from 1st Jan till 31st May 2009
• Compute mobility as different number of BTS visited
• Stages
• Medical Alert - Stage 1 (17th-27th April)
• Closing Schools - Stage 2 (28th-1st May)
• Suspension of Essential Activities - Stage 3 (1st May-6th May)
• Baselines
• same periods, different year (2008)
80% 55%0%
Agent-based Disease Model
Mobility
Model
Social
Network
Model
Disease
Model
Impact on disease propagation
10%
40h
Impact on disease propagation
Big Data for Social Good @ Telefonica
Crime Prediction
Analysis of impact of floodshttp://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
70% accuracy
Crime
Work with Bogomolov, A., Lepri, B., Staiano, J., Pianesi, F., Pentland, A.
Affects quality of life and economic development both
at the national and local level
Several studies explore relationships between crime
and socio-economic variables: education, income,
unemployment, ethnicity, …
Several studies have shown significant concentrations
of crime in small geographical areas: crime hotspots
Crime
T1: Natural surveillance as key deterrent for crime:
people moving around are eyes on the street (Jacobs, 1961)
high diversity among the population and high
number of visitors -> less crime
T2: Defensible space theory (Newman, 1972)
high mix of people -> more crime
Crime and Urban Environment
People-centric perspective vs Place-centric perspective
people-centric perspective used for
individual or collective criminal profiling
place-centric perspective used for
predicting crime hotspots
Crime Prediction
Data-driven and place-centric approach to
crime prediction
Multimodal approach: people dynamics
derived from mobile network data and
demographics
European metropolis: London
Prediction of crime hotspots and not criminals
profiling
Our Approach
Smartsteps Dataset:
for each of the Smartsteps cells a variety of demographic and
human dynamics variables were computed every hour for 3
weeks (from December 9 to December 15, 2012 and from
December 23, 2012 to January 5, 2013)
Criminal Cases Dataset: criminal cases for December 2012 and for January 2013
London Borough Profiles Dataset:
open dataset containing 68 metrics about the population of a
particular geographic area
Multimodal Approach: Data
• Footfall count: Shows the trend in footfall in a
specified area hourly, daily, weekly and monthly.
Provides a basic profile of the crowd.
• Catchment area: Shows which postal sectors are
your customers coming from by hour, day, week
and month. Shows the “battleground” for two sites.
• Transport mode: Shows flows of crowds from any
two points, segmented by road, air, train, etc.
SmartSteps
For each cell and for each hour the dataset contains:
an estimation of how many people are in the cell
the percentage of these people at home, at work or
just visiting the cell
the gender splits (male vs. female)
the age splits (0-20 years, 21-30 years, 31-40 years, …)
SmartSteps Data
Crime geolocation for 2 months (December 2012 –
January 2013)
All reported crimes in UK specifying month and year and
not specific day/time
Median crime value (=5) used as threshold
Spatial granularity of borough profiles is at LSOA levels:
LSOA are small geographical areas defined by UK Office
for National Statistics (mean population: 1500)
Crime Data
68 metrics about the population of a specific geographical area:
demographics, households, migrant population, employment,
earnings, life expectancy, happiness levels, house prices, etc.
Spatial granularity of borough profiles is at LSOA levels:
LSOA are small geographical areas defined by UK Office for
National Statistics (mean population: 1500)
London Borough Profiles Data
From Smartsteps data we extract
1st order features (mean, median, min., max.,
entropy, etc.)
2nd order features on sliding windows of variable
length (1 hour, 4 hours, 1 day, etc.) to account for
temporal patterns
Feature Extraction
Feature Selection
Mean decrease in Gini coefficient of inequality
the feature with maximum mean decrease in Gini
coefficient is expected to have the maximum
influence in minimizing the out-of-the-bag error
The feature selection process produced a
reduced subset of 68 features (from an initial pool
of about 6000 features)
Classification Approach
Binary classification task: high crime area vs low crime
area
10-fold cross-validation approach
Classifier: Random Forest (RF)
RF overcomes logistic regression, support vector
machines, neural networks, decision trees
Smartsteps-based classifier significantly outperforms baseline
majority and borough profiles-based classifiers
Experimental Results
ground-truth
Experimental Results ~70% accuracy in predicting crime hotspots
predictions
Features encoding daily dynamics have more predictive powerthan features extracted on a monthly basis
Relevance of high number of residents to predict crime areas
increased ratio of residents -> more crime (in contrast withNewman’s thesis)
Entropy-based features are useful for predicting the crimehotspots
high diversity of functions (home vs work) and high diversity ofpeople (gender and age) act as eyes on street decreasingcrime (in line with Jacobs’ thesis)
Relevant Features
Only 6 out of 68 features in the joint model areLondon Borough features, namely
%working population claiming out of work benefits
Largest migrant population
% overseas nationals entering the UK
% resident population born abroad
Relevant Features
Our method captures the dynamics of a place rather
than making extrapolations from previous crime
histories. We can use it in areas where people are less
inclined to report crimes
Our method provides new ways of describing
geographical areas: novel risk-inducing or risk-reducing
features of geographical areas
Implications
Relevant Publications“A gender-centric analysis of calling behavior in a developing country using
CDRS”, Vanessa Frias-Martinez, Enrique Frias-Martinez and Oliver, N., Proceedings of AAAI, 2009
"Prediction of Socioeconomic Levels using Cell Phone Records", Victor Soto and Vanessa Frias-Martinez and Jesus Virseda and Enrique Frias-Martinez, International Conference on User Modeling, Adaptation and Personalization, UMAP'11, Industrial Track, Girona, Spain, 2011
"An Agent-Based Model Of Epidemic Spread Using Human Mobility and Social Network Information", Vanessa Frias-Martinez et al, 3rd International Conference on Social Computing, SocialCom '11, Boston, USA, 2011.
Talk at WIRED 2013. London UK
http://www.wired.co.uk/news/archive/2013-10/17/nuria-oliver
Relevant Publications
“Moves on the street: Predicting Crime Hotspots using aggregated anonymized data on people dynamics” - A. Bogomolov, B. Lepri, J. Staiano, Leouze, E., N. Oliver, F. Pianesi, A. Pentland Journal of Big Data (Big Data Journal 2015)
“Once Upon a Crime: Towards Crime Prediction from Demographics and Mobile Data” - A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pianesi, A. Pentland 16th ACM International Conference on Multimodal Interaction (ICMI 2014)
"Flooding through the Lens of Mobile Phone Activity"Pastor-Escuredo, D., Torres Fernandez, Y., Bauer, J.M., Wadhwa, A., Castro-Correa, C., Romanoff, L., Lee, J.G., Rutherford, A., Frias-Martinez, V., Oliver, N., Frias-Martinez, E. and Luengo-Oroz, M. Proceedins of IEEE Global Humanitarian Technology Conference, GHTC 2014, Silicon Valley, CA, Oct 2014
Opportunities
Temporal and Spatial Granularity• Big Data can be available in real-time or if not in real time much more frequently than how
data is typically collected (every 5-10 years for example for census data);
• Some kinds of Big Data (e.g. data about the city, collected by sensors placed in the urban
infrastructure) can be collected with significantly finer grained spatial granularity than with
traditional methods;
Cost and Effort
Accuracy• It could be argued that some kinds of data that are relevant for the public sector (e.g.
migrations) can be collected more accurately by automatic means through Big Data
platforms than by manual means as it is the state of the art.
• In addition, given that there isn’t a human-in-the-loop, the data is less prone to human errors
and potential biases introduced by humans.
• Most of the Big Data that could be used for the public sector is data that
has been collected already for other purposes. In addition, Big Data is
typically collected by automatic means which makes its collection very
cost-efficient;
Challenges
Risk/Benefit Analysis• Even while adopting all sorts of precautions, if by any chance there is an error in the data
extraction such that there could be a privacy leak, the consequences for the business could
be devastating
• Hence, the potential benefit would have to be really large to compensate for the risk.
• As this is an emergent research area, the benefits are still to be defined
Regulatory/Social• Lack of updated regulation
• Lack of clear guidelines regarding safe data handling, processing and sharing forhumanitarian purposes
• Risk of potential unintended social consequences
• Risk of creating a digital divide, unbalanced access to data and-or expertise on how to
analyze it and make sense of it
Internal Barriers
• Big Data for Development/Social Good projects are typically not part of
any business unit in the Telcos;
Technical
• Representativeness of the data, generalization
• Combination of data from multiple sources
• Real-time analysis and prediction• Lack of ground truth intervention to validate
• Significant vs substantially significant
• Correlations vs causality
Privacy/Security• Potential privacy risks need to be minimized and understood.
Control and transparency
• Security and traceability of the data
• Clear code of conduct and ethical principles when dealing
with data
• Strict access control when appropriate
Framework for Data Sharing
Remote Access
Question and Answers
Limited License
Pre-computed Indicators &
Synthetic Data
SecuritySecure access to the data
AnonymizationLimited or aggregated data release
Da
ta p
rote
cte
d t
hro
ug
h
ApplicativeExploratoryLow to medium
number of users
Medium to large number
of users and open data
Development stage
What can we all do to responsibly turn this
opportunity into a reality ?
@nuriaoliver