Data Warehousing for Social Network and Human Mobility...

38
Data Warehousing for Social Network and Human Mobility Research and Detecting Real-World Events in Space and Time Using Feature Clustering Alec Pawling Department of Computer Science and Engineering University of Notre Dame December 7, 2009 1 / 38

Transcript of Data Warehousing for Social Network and Human Mobility...

Page 1: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Data Warehousing for Social Network andHuman Mobility Research and Detecting

Real-World Events in Space and Time UsingFeature Clustering

Alec Pawling

Department of Computer Science and EngineeringUniversity of Notre Dame

December 7, 2009

1 / 38

Page 2: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview

1. Introduction

2. Background on Database Management Systems and DataWarehousing

3. Designing a Data Warehouse for Social Network and HumanMobility Research

4. Approach for Merging Noisy Data in a Scientific DataWarehouse

5. Background on Data Clustering

6. Online Cluster Analysis of Spatially Partitioned Phone UsageData

7. Feature Clustering for Data Steering in Dynamic Data DrivenApplication Systems

8. Summary

2 / 38

Page 3: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview: Dynamic Data Driven Application Systems1

Utilize simulations and models that can

I Incorporate newly available data online, and

I Steer the data collection mechanism to obtain the most usefuldata

Areas of application:

I Population movement forecasting (Schoenharl and Madey2008)

I Forecasting the spread of fire: wildfire (Mendel et al, 2007), inbuildings (Chaturvedi, et al., 2005)

I Weather (Ramakrishnan et al., 2007), hurricane (Allen, 2007)forecasting

I and many others.

1Portions of this work were funded by the National Science Foundation:DDDAS Program, Grant CNS-0540348.

3 / 38

Page 4: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview: Wireless Phone Based Emergency ResponseSystem

4 / 38

Page 5: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview: Road Map

In this presentation, we will cover two aspects of the WIPERsystem:

I The Historical Data SourceI Data partitioning aspect of the warehouse design.I Issues arising from the data merge.

I Detection and Alert SystemI Feature clustering approach for event detection.

5 / 38

Page 6: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

A Data Warehouse for Social Network and HumanMobility Research

6 / 38

Page 7: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview: Data Warehousing

Data warehouses store long term historical data organized in sucha way that it can be effectively used for analysis (traditionallydecision support in a business environment).

I Data residing in a warehouse are often from different sources(departments).

I The warehouse is non-volatile: once loaded, the data are notupdated.

7 / 38

Page 8: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview: The Phone Data

Two major components:I Usage data

I Call data: billing records of service usage initiated bycustomers of the company.

I Interconnect data: billing records of service usage initiated byusers that are not customers of the company and received bycustomers of the company.

I Call Record Data (CDR): usage records with towerinformation. Four types of records:

I MOC: record for a voice call from the originating towerI MTC: record for a voice call from the terminating towerI SOM: record for an SMS from the originating towerI STM: record for an SMS from the terminating tower

I Customer dataI Defines the set of “in-company” users.I Users that are not in the customer data are “out-of-company”.

8 / 38

Page 9: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Motivation

Data used for:

I Development of the Simulation and Prediction and Detectionand Alert Systems of WIPER.

I Social network and human mobility research.

Problems:

I Large (5 TB) set of flat files.

I Inefficient de-identification mechanism (phone numbersreplaced by strings)

I Data preparation is time consuming, error prone, and oftenredundant.

Goal: Develop a repository that allows efficient extraction ofrelevant subsets of the data.

9 / 38

Page 10: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Related Work:

Gray et al., 2005. Advocate for the use of database managementsystems instead of file formats traditionally used in scientificresearch.

Examples of scientific databases:

I Bioinformatics: INTERACT (Eilbeck, et al, 1999)

I Astronomy: Sloan Digital Sky Survey (O’Mullane, et al.,2005)

I High Energy Physics: BaBar (CERN) (Becla and Wang, 2005)

10 / 38

Page 11: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Warehouse Design

Design considerations: (Inmon, 2005)

I Granularity: Level of detail.

I Partitioning: Divide the data into manageable chunks.

11 / 38

Page 12: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Data Partitioning

Guided by the way in which the data has been historically used:groups of greatest interest are voice calls and “in-company” users.

I Onnela et al. 2007. Structure and Tie Strengths in MobileCommunication Networks. Proceedings of the NationalAcademy of Sciences.

I Gonzalez et al. 2008. Understanding Human MobilityPatterns. Nature.

I Wang et al. 2009. Understanding the Spreading Patterns ofMobile Phone Viruses. Science.

12 / 38

Page 13: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Data Partitioning

13 / 38

Page 14: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Usage Data Integration: Partition and Merge

14 / 38

Page 15: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Introduction of Duplicates by Merge

0.0000.1000.2000.3000.4000.5000.6000.7000.800

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time (seconds)

Merged intercall time

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time for MOC records

0.0000.0050.0100.0150.0200.0250.030

0 10 20 30 40 50 60

Intercall time for MTC records

10-4

10-3

10-2

10-1

100

100 101

Intercall time (seconds)

Merged intercall time (log-log)

15 / 38

Page 16: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Cleaning

I Consider only records that haven’t been matched (initiallyrecords with a value in exactly one of the two antenna fields).

I Add perturbation to the start time of records having a value inonly the terminating antenna field and attempt to match withrecords having a value in only the originating antenna field.

I Sequence of perturbation: +1,−1, +2,−2, · · · + 60,−60.

I Fields matched at each step are removed from consideration:I Match records with smallest possible perturbation.I Match each record only once.

16 / 38

Page 17: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results

0.0000.1000.2000.3000.4000.5000.6000.7000.800

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time (seconds)

Merged intercall time

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time for MOC records

0.0000.0050.0100.0150.0200.0250.030

0 10 20 30 40 50 60

Intercall time for MTC records

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Intercall time (seconds)

Merged intercall time (corrected)

17 / 38

Page 18: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60

Cum

ulat

ive

prob

abili

ty

Time difference

Merged (uncorrected)MOC

Merged (corrected)Call

MTC

18 / 38

Page 19: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Consequences

I In many cases, these duplicates can be ignored, e.g.I Building graphs with edges weighted by total duration or cost

(Onnela et al, 2007).I Generating user trajectories (Gonzalez et al, 2008).

I Can be problematic when generating time series of callactivity.

19 / 38

Page 20: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Summary

I Data warehouse contains 14 months of data.I Data loading process in highly automated:

I Two scripts: customer and usage dataI Each script:

I validates fieldsI loads the initial tablesI replaces user strings with integersI integrates the various record typesI usage script partitions the records

I Data extract from the warehouse is in use for a social networkstudy

I Comparable data extraction from text files takes weeks: codeis written to identify relevant users, replace user ID strings,filter data.

I One year of data can be extracted in one day using a singleSQL query.

20 / 38

Page 21: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Detecting Real-World Events in Space and TimeUsing Feature Clustering

21 / 38

Page 22: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Overview

22 / 38

Page 23: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Key Observation: Certain Events Cause Localized Changesin Call Activity

23 / 38

Page 24: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Clustering

Let D be an n ×m matrix:

I n is the number of observations

I m is the number of features per observation

I In this case, features are the call activity at each spatial area(postal code, tower). Data item is recorded every 10 minutes.

Goal: partition the n rows D into a natural grouping.

I Minimize intra-cluster distance.

I Maximize inter-cluster distance.

We want to cluster the time series, so we transpose D beforeclustering; this is called feature clustering.

24 / 38

Page 25: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Related Work

I Rodrigues et al. (2004). Stream feature clustering algorithm.I Correlation dissimilarity: sufficient statistics require quadratic

space with respect to the number of features.

I Aggarwal et al. (2003).I If entire history is used, stale data dominates results.

25 / 38

Page 26: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Approach

Dataset:

I Aggregate records based on spatial area: postal code or tower.

I Time series: number of calls in 10 minute intervals for eachspatial area.

Clustering:

I Single link clustering over a 1 day sliding window.

Event Detection:

I Outliers are detected by examining maximum distancebetween clusters.

I Baseline is computed using the maximum distance betweenclusters over the 2 months of data.

Two events:

I An explosion.

I Celebration following a sports victory.

26 / 38

Page 27: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results: Bombing

50

100

150

200

250

300

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6Max

imum

dis

tanc

e be

twee

n cl

uste

rs

Time (hours)

µ[µ - σ, µ + σ]

Event

Figure: The maximum distance between clusters for a 12 hour windowaround the bombing with the mean and ± 1 standard deviation over 2months. The bombing occurs between time = 0 and time = 0.5(conflicting news reports)

27 / 38

Page 28: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results: Bombing

1

6010

014

018

0

Cluster Dendrogram at Time = 0

Dis

tanc

e

1

6010

016

0

Cluster Dendrogram at Time = 1

Dis

tanc

e

1

5015

025

0 Cluster Dendrogram at Time = 2

Dis

tanc

e

28 / 38

Page 29: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results: Celebration

20 40 60 80

100 120 140 160 180 200 220

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6Max

imum

dis

tanc

e be

twee

n cl

uste

rs

Time (hours)

µ[µ - σ, µ + σ]

Event

Figure: The maximum distance between clusters for a 12 hour windowaround the celebration with the mean and ± 1 standard deviation over 2months. Important events are: at approximately time −1 the sportingevent ends, at time = 0, the crowd has gathered to meet the team, andat approximately time = 1 the team leaves the celebration.

29 / 38

Page 30: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Results: Celebration

5 1 3 4 2

040

80

Cluster Dendrogram at Time = −1

Dis

tanc

e

5

1 2 3 4

050

100

Cluster Dendrogram at Time = 0

Dis

tanc

e

51 3 2 4

050

150

Cluster Dendrogram at Time = 1

Dis

tanc

e

30 / 38

Page 31: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Summary

I In cases were events cause localized changes in call activity,we can use feature clustering to detect the events and theirapproximate location.

I This achieves our goals of automating event detection andconstraining the area that must be simulated by theSimulation and Prediction System.

31 / 38

Page 32: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Summary

32 / 38

Page 33: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Summary of Research

Two components:I Historical Data Source. Data repository for:

I development of WIPER system componentsI social network and human mobility research

I Feature clustering algorithm for event detectionI In cases where event causes a localized change in call activity,

the event and it’s approximate location can be identified.

33 / 38

Page 34: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

First Author Publications

I Alec Pawling, Ping Yan, Julian Candia, Tim Schoenharl, andGreg Madey. “Anomaly Detection in Streaming Sensor Data,”In: Intelligent Techniques for Warehousing and Mining SensorNetwork Data. Alfredo Cuzzocrea, Ed., 2009.

I Alec Pawling and Greg Madey. “Feature Clustering for DataSteering in Dynamic Data Driven Application Systems.”Proceedings of the 9th International Conference onComputational Science, 2009.

I Alec Pawling, Tim Schoenharl, Ping Yan, and Greg Madey.“WIPER: An Emergency Response System.” In: Proceedingsof the 5th International ISCRAM Conference, 2008.

I Alec Pawling, Nitesh Chawla, and Greg Madey. “AnomalyDetection in a Mobile Communication Network.”Computational & Mathematical Organization Theory. 13:4,December, 2007.

34 / 38

Page 35: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

First Author Publications

I Alec Pawling, Nitesh Chawla, and Greg Madey. “AnomalyDetection in a Mobile Communication Network.” In:Proceedings of the Annual Conference of the North AmericanAssociation for Computational Social and OrganizationSciences. 2006. (Received Best Student Paper Award).

I Alec Pawling, Nitesh Chawla, and Amitabh Chaudary.“Evaluation of Summarization Schemes for Learning inStreams.” In: Proceedings of the 10th European Conferenceon Principles and Practice of Knowledge Discovery inDatabases.” 2006.

I Alec Pawling and Nitesh V. Chawla, and Amitabh Chaudhary.“Computing Information Gain in Data Streams.” In:Proceedings of the ICDM Workshop on Temporal DataMining: Algorithms, Theory, and Applications. 2005.

35 / 38

Page 36: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Other Publications

I Gregory R. Madey, Albert-Laszlo Barabasi, Nitesh V. Chawla,Marta Gonzalez, David Hachen, Brett Lantz, Alec Pawling,Timothy Schoenharl, Gabor Szabo, Pu Wang and Ping Yan,”Enhanced Situational Awareness: Application of DDDASConcepts to Emergency and Disaster Management ”, inInternational Conference on Computational Science, serialLecture Notes in Computer Science (LNCS 4487), Y. Shi, G.D. van Albada, J. Dongarra, and P. M. A. Sloot, Eds., May2007, pp. 1090-1097.

36 / 38

Page 37: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

Publication Plan

I Data warehouse implementation (Chapters 3, 4): Plannedsubmission to the International Journal of Data Warehousingand Mining, pending revision.

I Feature clustering approach for event detection (Chapter 7):Planned submission to IEEE Transactions on Patern Analysisand Machine Intelligence, pending revision.

37 / 38

Page 38: Data Warehousing for Social Network and Human Mobility ...dddas/Papers/Pawling_dissertation_slides_fin… · of the 5th International ISCRAM Conference, 2008. I Alec Pawling, Nitesh

AcknowledgmentsThis material is based upon work supported by the NationalScience Foundation under Grant Nos. CNS-0540348 andBCS-0826958, Northeastern University, and the Defense ThreatReduction Agency under Grant BRBAA07-J-2-0035.

I would also like to thank:I My advisor: Dr. Madey.I My committee: Dr. Chawla, Dr. Chaudhary, and Dr.

Poellabauer.I The outside chair: Dr. Hachen.I Colleagues in Dr. Madey’s Group: Tim Schoenharl, Ping Yan,

Ryan Bravo, and Ryan McCune.I Dr. Barabasi and the members of the Center for Complex

Network Research at Northheastern University. In particular, Iam grateful to Dr. Marta Gonzalez, Dr. Julian Candia, Dr.Sune Lehmann, Dr. Jim Bagrow, Nick Blumm and DashunWang.

38 / 38