Telecom Italia Big Data Challenge
-
Upload
groupon -
Category
Data & Analytics
-
view
115 -
download
3
Transcript of Telecom Italia Big Data Challenge
Big Data ChallengeCOMP 41700Seminars in Data Science
Summary of the presentation:
Short Introduction of Telecom Italia Big Data Challenge – Donagh Summary of Paper 1 and Paper 2 – Rajesh Other interesting insights we can draw from this dataset – Malika
a contest designed to stimulate the creation and development of innovative technological ideas in
the Big Data field
history
• Early 2014 Telecom Italia released first edition which was closed
• Success meant that the next iteration was open
• Freely available for anyone to use.
• https://dandelion.eu/datamine/open-big-data/
data sets
• Geo-referenced (Milan and the Autonomous Province of Trento)
• Anonymised
• Millions of records
• November -> December 2013
• extracted from telecom records, energy, weather, public and private transport, social networks
Milano / Trentino
• Grid
grid
Milano datasetsDomain
Telecommunications SMS, Call Internet; MI to Provinces; MI to MI;
Weather Weather Station Data ; Precipitation
Environment Air Quality
News Milano Today
Social Tweets
tweets
• username - anonymised
• entities
• language
• municipality
• Tweet time
• geometry
Paper 1(Anatomy and efficiency of urban multimodal mobility)
Main Goal: To find the optimal time-respecting path between two Geo locations in multi-modal layer
Where, l(a,b) is the quickest length (time respecting and minimal) trips on the network d(a,b) is the euclidean distance from the origin 'a' to the destination 'b'
Rail becomes then dominant at 40 kms and air travel is dominant for trips of distance of order 700 kms. Other transportation modesplay a secondary role, with peaks at 22 kms for the Metro, 40 kms for Ferries and 70 kms for Coaches
The bus system is covering most of the short trips, whereas the advantage of using the Metro and Rail systems emerges progressively for longer distances
The total number of stop events Omega grows proportionally with the urban area populations P.
Where, C(alpha) is the number of stop events in the layer 'alpha' and Delta-t is theduration of the time interval
Paper 2(High resolution population estimates from telecommunications data)
Data Source: Telecommunications(provided by Telecom Italia) Census data
Satellite images(provided by Landsat)
Main Goal: Create high-resolution(235m x 235m) population estimates in time and space
Difficulties: Population counts can change rapidly that means is hard to acquire local census estimates in a timely and accurate manner. The correlation coefficient between call volume and the
underlying population distribution vary with time.
Building map:
41% of area on the map are directly generated.
To classify the remaining 59% , they train a Random forest classifier using OpenStreetMap data as labeled training examples.
Population is distributed exponentially in the beginning:29% of grid-squares have zero population5% of grid-squares have a population of 13% of grid-squares have population of 2 and so on.
39% of grid-squares have a population over 100
Then follow a normal distribution with a mean of 400 persons
Population Distribution:
10-minute intervals for each of the 235m × 235m grid cells.
Communication activity is approximately log normal There are 5 types of communications activity: SMSIN, SMSOUT, CALLIN, CALLOUT, and INTERNET.
Telecommunications activity:
Elementary Model:
Previous research have suggest that the relation between location(i), population and telecommunication:
(w stands for call volume, p stands for population)
Not Perfect:
The relationship between call volume and populationin this region is much weaker below a threshold of 351 persons.
Main reason is that the dense population area tend to have more cell tower for we to observe the relationship.
Model(1):
Model(2):
Try to find the best hours of call volume data:
Each type correlates most strongly during the hour from 10 am to 11 am, and as with the total call volumes, CALLOUT has the greatest correlation, Approximately 0.68. Thus we use CALLOUT from 10 am to 1 am for the wi in model(2).
Where else can we use the Telecom Italia Dataset?
Analyzing cities using the space-time structure of mobile phone network
• Attempts to connect telecom usage data from Telecom Italia mobile to geography of human activity
• Usage of telecom data to enhance the understanding of cities as space of flows
Using Telecom Dataset for social network analysis
investigating social structures through the use of network and graph theories.
Anthropology, Biology, Communication Studies, …etc
social network analysis
Traffic monitoring in urban area.
• Use of Telecom data to track the dense regions.• Rerouting strategies• Increase the public transport in dense area.• Provide more taxies in dense area.
Other Usages
Users localization Security
Health Care : Tracking users exercises
Thank you...
Special Thanks to my team members:Hao Wu and He Ping