Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... ·...
Transcript of Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... ·...
![Page 1: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/1.jpg)
GRANDATA @ 2014 – All rights reserved.
®
Big Data para comprender la Dinámica Humana
Carlos SarrauteGrandata Labs
Hablemos de Big Data – 26 noviembre 2014
![Page 2: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/2.jpg)
GRANDATA
Grandata● Founded in 2012.● Leverages advanced research in Human Dynamics (the
application of “big data” to social relationships and human behavior) ● to identify market trends and predict customer actions ● integrating first-party and telco partners data.
Grandata Labs Research team, 5 researchers, based in Buenos Aires, Argentina. Research Interests:
● Study mobility patterns, social interactions and their correlations in dynamic and mobile social networks.
● Integrating categorically different social networks to enhance our understanding of users social behavior and interactions
● e.g. Mobile phone social network and bank transactions spending behavior networks
Brief presentation
![Page 3: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/3.jpg)
GRANDATA
MIT Human Networks and Mobility● Marta Gonzalez
MIT Human Dynamics● Alex "Sandy" Pentland
INRIA● Aline Viana, Eric Fleury
City College of New York● Hernan Makse
Scientific collaborations
![Page 4: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/4.jpg)
GRANDATA
Mobile Data Source
● Mobile phone company in Mexico with ~10% of market share (over 7 million users, logs include 90 million users).● The raw data logs of calls between their clients and external
users.● Data collected over a three month period.● 2,185,852,564 calls● 2,033,719,579 messages
● Each record contains: ● hashed id of caller and callee● Date, Time and duration of call● Geo-location of caller is client
● Age for subset of users (ground truth)
![Page 5: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/5.jpg)
GRANDATA
Create graph from CDRs
Overview of this work
GRANDATAPHONE COMPANY
Transfer CDRs + groundtruth to GRANDATA (Hashed ids)
Analyse ground truth
Reaction Diffusion Algorithm
Topological metrics
Selecting categories from probability vectors
GranData Servers
![Page 6: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/6.jpg)
GRANDATA
Recognition
![Page 7: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/7.jpg)
GRANDATA
Observational Study
![Page 8: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/8.jpg)
GRANDATA
Analyzing the Ground Truth Data
We have over 500,000 users with known age and gender.
Bimodal distribution
Age Population pyramid
![Page 9: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/9.jpg)
GRANDATA
Characterization Variables
Number of Calls incoming calls / outgoing calls weekdays (Monday to Friday) / weekend ``daylight'' (from 7 a.m. to 7 p.m.) / ``night'' (before 7
a.m. and after 7 p.m.).
Duration of Calls
Number of SMS
Number of Contact Days
In/Out-degree of the Social network
Degree of the Social network
![Page 10: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/10.jpg)
GRANDATA
Differences between genders
Variable Female Male
Total duration 10038.75 10663.17
Total duration outgoing
6359.96 7239.53
Total duration incoming
3678.78 3423.64
p(M|F) < p(M) = 0.5683 < p(M|M)
p(F|M) < p(F) = 0.4317 < p(F|F)
![Page 11: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/11.jpg)
GRANDATA
Age homophily in communications behavior (M)
Partly due to the double peak in the age histogram
Age homophily
![Page 12: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/12.jpg)
GRANDATA
Random links matrix (R)
![Page 13: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/13.jpg)
GRANDATA
Difference (M - R)
![Page 14: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/14.jpg)
GRANDATA
Age homophily – number of links
Inflection point
![Page 15: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/15.jpg)
GRANDATA
Prediction Results
![Page 16: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/16.jpg)
GRANDATA
Gender prediction
● Tried: Naive Bayes, Logistic Regression, Linear SVM, Linear Discriminant Analysis and Quadratic Discriminant Analysis.
● Best results: Linear SVM, Logistic Regression
● Precision obtained:
Population 1 1/2 1/4 1/8
Precision 66.3 % 72.9 % 77.1 % 81.4 %
![Page 17: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/17.jpg)
GRANDATA
Age prediction
● Tried: Multinomial Logistic based on node features
● Problem: Doesn't harness the network topology
– In particular the strong age homophily
![Page 18: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/18.jpg)
GRANDATA
Building the Social Graph
Caller id Callee id
55005451 | 12090916222162983 | 184929357147007922 | 20733284860742254 | 9360064832352175 | 77333835204344268 | 20468774021224475 | 429633341439001 | 1932615614727540 | 51241342…....…... ~250 million edges
~70 million users
Hashed origin Hashed target DIRECTION TIME CITY LATITUD LONGITUD OPERATOR DATE
725BB5BFC026CB1 0CD8324BF87BC979 OUTGOING 36 Obregon 19.35 - 99.21 TELCEL 15/04/2013 12:00:44 p.m.CAAEBD085D13B86 82B005A384D23523E OUTGOING 38 Obregon 19.35 -99.21 TELCEL 15/04/2013 08:35:32 p.m.F49F7DE9DDECE07 304B6A2B8BC8BD6D OUTGOING 206 Merida 21.01 -89.59 IUSACELL 15/04/2013 04:28:59 p.m.
Raw data CDRs: (adapted for presentation)
Seeds (red)
- Symetrized links
- Weight are 0 or 1
- Components with no seeds are removed
![Page 19: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/19.jpg)
GRANDATA
CATEGORY PROBABILITY
INFORMATION FLOW
Reaction Diffusion Algorithm
Graph Laplacian
Reactive term Diffusion term
Tunning parameterAge category
Age categories:[10-24, 25-34, 35-50, 50+]
![Page 20: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/20.jpg)
GRANDATA
Selecting age categories from the probability state
Maximum probability: For each node select category with highest probability.
Pyramid scaling: Select category values using maximum probability constrained to having a population pyramid given by the seed nodes
maximum probability
Pyramid scaling
![Page 21: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/21.jpg)
GRANDATA
Precision obtained for age prediction
Population Machine Learning Reaction-Diffusion
q = 1 36.9 % 43.4 %
q = 1/2 42.9 % 47.2 %
q = 1/4 48.4 % 56.1 %
q = 1/8 52.7 % 62.3 %
Generate predictions between 4 categories.
![Page 22: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/22.jpg)
GRANDATA
Conclusions
● First extensive study of social interactions in the country of Mexico focusing on gender and age, based on mobile phone usage.
● Gender homophily and an asymmetry respect to incoming and outgoing calls between men and women
● Strong age homophily
● Standard Machine Learning tools finding that Logistic Regression and Linear SVM gave best results
● Purely graph based Reaction-Diffusion algorithm.
![Page 23: Big Data para comprender la Dinámica Humanadatamining.dc.uba.ar/datamining/files/Charlas_y... · Hablemos de Big Data – 26 noviembre 2014. GRANDATA Grandata Founded in 2012. Leverages](https://reader033.fdocuments.in/reader033/viewer/2022050604/5fab6ed50aadfc1d0f598f37/html5/thumbnails/23.jpg)
GRANDATA
Future steps
● Analysis of the performance as function of parameters and topological metrics
● Add mobility information
– Differences in mobility patterns between genders and age groups
● Apply this methodology to predict user's spending behavior