Data Warehousing for Social Network and Human Mobility...

Post on 26-Jun-2020

1 views 0 download

Transcript of Data Warehousing for Social Network and Human Mobility...

Data Warehousing for Social Network andHuman Mobility Research and Detecting

Real-World Events in Space and Time UsingFeature Clustering

Alec Pawling

Department of Computer Science and EngineeringUniversity of Notre Dame

December 7, 2009

1 / 38

Overview

1. Introduction

2. Background on Database Management Systems and DataWarehousing

3. Designing a Data Warehouse for Social Network and HumanMobility Research

4. Approach for Merging Noisy Data in a Scientific DataWarehouse

5. Background on Data Clustering

6. Online Cluster Analysis of Spatially Partitioned Phone UsageData

7. Feature Clustering for Data Steering in Dynamic Data DrivenApplication Systems

8. Summary

2 / 38

Overview: Dynamic Data Driven Application Systems1

Utilize simulations and models that can

I Incorporate newly available data online, and

I Steer the data collection mechanism to obtain the most usefuldata

Areas of application:

I Population movement forecasting (Schoenharl and Madey2008)

I Forecasting the spread of fire: wildfire (Mendel et al, 2007), inbuildings (Chaturvedi, et al., 2005)

I Weather (Ramakrishnan et al., 2007), hurricane (Allen, 2007)forecasting

I and many others.

1Portions of this work were funded by the National Science Foundation:DDDAS Program, Grant CNS-0540348.

3 / 38

Overview: Wireless Phone Based Emergency ResponseSystem

4 / 38

Overview: Road Map

In this presentation, we will cover two aspects of the WIPERsystem:

I The Historical Data SourceI Data partitioning aspect of the warehouse design.I Issues arising from the data merge.

I Detection and Alert SystemI Feature clustering approach for event detection.

5 / 38

A Data Warehouse for Social Network and HumanMobility Research

6 / 38

Overview: Data Warehousing

Data warehouses store long term historical data organized in sucha way that it can be effectively used for analysis (traditionallydecision support in a business environment).

I Data residing in a warehouse are often from different sources(departments).

I The warehouse is non-volatile: once loaded, the data are notupdated.

7 / 38

Overview: The Phone Data

Two major components:I Usage data

I Call data: billing records of service usage initiated bycustomers of the company.

I Interconnect data: billing records of service usage initiated byusers that are not customers of the company and received bycustomers of the company.

I Call Record Data (CDR): usage records with towerinformation. Four types of records:

I MOC: record for a voice call from the originating towerI MTC: record for a voice call from the terminating towerI SOM: record for an SMS from the originating towerI STM: record for an SMS from the terminating tower

I Customer dataI Defines the set of “in-company” users.I Users that are not in the customer data are “out-of-company”.

8 / 38

Motivation

Data used for:

I Development of the Simulation and Prediction and Detectionand Alert Systems of WIPER.

I Social network and human mobility research.

Problems:

I Large (5 TB) set of flat files.

I Inefficient de-identification mechanism (phone numbersreplaced by strings)

I Data preparation is time consuming, error prone, and oftenredundant.

Goal: Develop a repository that allows efficient extraction ofrelevant subsets of the data.

9 / 38

Related Work:

Gray et al., 2005. Advocate for the use of database managementsystems instead of file formats traditionally used in scientificresearch.

Examples of scientific databases:

I Bioinformatics: INTERACT (Eilbeck, et al, 1999)

I Astronomy: Sloan Digital Sky Survey (O’Mullane, et al.,2005)

I High Energy Physics: BaBar (CERN) (Becla and Wang, 2005)

10 / 38

Warehouse Design

Design considerations: (Inmon, 2005)

I Granularity: Level of detail.

I Partitioning: Divide the data into manageable chunks.

11 / 38

Data Partitioning

Guided by the way in which the data has been historically used:groups of greatest interest are voice calls and “in-company” users.

I Onnela et al. 2007. Structure and Tie Strengths in MobileCommunication Networks. Proceedings of the NationalAcademy of Sciences.

I Gonzalez et al. 2008. Understanding Human MobilityPatterns. Nature.

I Wang et al. 2009. Understanding the Spreading Patterns ofMobile Phone Viruses. Science.

12 / 38

Data Partitioning

13 / 38

Usage Data Integration: Partition and Merge

14 / 38

Introduction of Duplicates by Merge

0.0000.1000.2000.3000.4000.5000.6000.7000.800

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time (seconds)

Merged intercall time

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time for MOC records

0.0000.0050.0100.0150.0200.0250.030

0 10 20 30 40 50 60

Intercall time for MTC records

10-4

10-3

10-2

10-1

100

100 101

Intercall time (seconds)

Merged intercall time (log-log)

15 / 38

Cleaning

I Consider only records that haven’t been matched (initiallyrecords with a value in exactly one of the two antenna fields).

I Add perturbation to the start time of records having a value inonly the terminating antenna field and attempt to match withrecords having a value in only the originating antenna field.

I Sequence of perturbation: +1,−1, +2,−2, · · · + 60,−60.

I Fields matched at each step are removed from consideration:I Match records with smallest possible perturbation.I Match each record only once.

16 / 38

Results

0.0000.1000.2000.3000.4000.5000.6000.7000.800

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time (seconds)

Merged intercall time

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Pro

babi

lity

Intercall time for MOC records

0.0000.0050.0100.0150.0200.0250.030

0 10 20 30 40 50 60

Intercall time for MTC records

0.0000.0050.0100.0150.0200.0250.0300.035

0 10 20 30 40 50 60

Intercall time (seconds)

Merged intercall time (corrected)

17 / 38

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60

Cum

ulat

ive

prob

abili

ty

Time difference

Merged (uncorrected)MOC

Merged (corrected)Call

MTC

18 / 38

Consequences

I In many cases, these duplicates can be ignored, e.g.I Building graphs with edges weighted by total duration or cost

(Onnela et al, 2007).I Generating user trajectories (Gonzalez et al, 2008).

I Can be problematic when generating time series of callactivity.

19 / 38

Summary

I Data warehouse contains 14 months of data.I Data loading process in highly automated:

I Two scripts: customer and usage dataI Each script:

I validates fieldsI loads the initial tablesI replaces user strings with integersI integrates the various record typesI usage script partitions the records

I Data extract from the warehouse is in use for a social networkstudy

I Comparable data extraction from text files takes weeks: codeis written to identify relevant users, replace user ID strings,filter data.

I One year of data can be extracted in one day using a singleSQL query.

20 / 38

Detecting Real-World Events in Space and TimeUsing Feature Clustering

21 / 38

Overview

22 / 38

Key Observation: Certain Events Cause Localized Changesin Call Activity

23 / 38

Clustering

Let D be an n ×m matrix:

I n is the number of observations

I m is the number of features per observation

I In this case, features are the call activity at each spatial area(postal code, tower). Data item is recorded every 10 minutes.

Goal: partition the n rows D into a natural grouping.

I Minimize intra-cluster distance.

I Maximize inter-cluster distance.

We want to cluster the time series, so we transpose D beforeclustering; this is called feature clustering.

24 / 38

Related Work

I Rodrigues et al. (2004). Stream feature clustering algorithm.I Correlation dissimilarity: sufficient statistics require quadratic

space with respect to the number of features.

I Aggarwal et al. (2003).I If entire history is used, stale data dominates results.

25 / 38

Approach

Dataset:

I Aggregate records based on spatial area: postal code or tower.

I Time series: number of calls in 10 minute intervals for eachspatial area.

Clustering:

I Single link clustering over a 1 day sliding window.

Event Detection:

I Outliers are detected by examining maximum distancebetween clusters.

I Baseline is computed using the maximum distance betweenclusters over the 2 months of data.

Two events:

I An explosion.

I Celebration following a sports victory.

26 / 38

Results: Bombing

50

100

150

200

250

300

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6Max

imum

dis

tanc

e be

twee

n cl

uste

rs

Time (hours)

µ[µ - σ, µ + σ]

Event

Figure: The maximum distance between clusters for a 12 hour windowaround the bombing with the mean and ± 1 standard deviation over 2months. The bombing occurs between time = 0 and time = 0.5(conflicting news reports)

27 / 38

Results: Bombing

1

6010

014

018

0

Cluster Dendrogram at Time = 0

Dis

tanc

e

1

6010

016

0

Cluster Dendrogram at Time = 1

Dis

tanc

e

1

5015

025

0 Cluster Dendrogram at Time = 2

Dis

tanc

e

28 / 38

Results: Celebration

20 40 60 80

100 120 140 160 180 200 220

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6Max

imum

dis

tanc

e be

twee

n cl

uste

rs

Time (hours)

µ[µ - σ, µ + σ]

Event

Figure: The maximum distance between clusters for a 12 hour windowaround the celebration with the mean and ± 1 standard deviation over 2months. Important events are: at approximately time −1 the sportingevent ends, at time = 0, the crowd has gathered to meet the team, andat approximately time = 1 the team leaves the celebration.

29 / 38

Results: Celebration

5 1 3 4 2

040

80

Cluster Dendrogram at Time = −1

Dis

tanc

e

5

1 2 3 4

050

100

Cluster Dendrogram at Time = 0

Dis

tanc

e

51 3 2 4

050

150

Cluster Dendrogram at Time = 1

Dis

tanc

e

30 / 38

Summary

I In cases were events cause localized changes in call activity,we can use feature clustering to detect the events and theirapproximate location.

I This achieves our goals of automating event detection andconstraining the area that must be simulated by theSimulation and Prediction System.

31 / 38

Summary

32 / 38

Summary of Research

Two components:I Historical Data Source. Data repository for:

I development of WIPER system componentsI social network and human mobility research

I Feature clustering algorithm for event detectionI In cases where event causes a localized change in call activity,

the event and it’s approximate location can be identified.

33 / 38

First Author Publications

I Alec Pawling, Ping Yan, Julian Candia, Tim Schoenharl, andGreg Madey. “Anomaly Detection in Streaming Sensor Data,”In: Intelligent Techniques for Warehousing and Mining SensorNetwork Data. Alfredo Cuzzocrea, Ed., 2009.

I Alec Pawling and Greg Madey. “Feature Clustering for DataSteering in Dynamic Data Driven Application Systems.”Proceedings of the 9th International Conference onComputational Science, 2009.

I Alec Pawling, Tim Schoenharl, Ping Yan, and Greg Madey.“WIPER: An Emergency Response System.” In: Proceedingsof the 5th International ISCRAM Conference, 2008.

I Alec Pawling, Nitesh Chawla, and Greg Madey. “AnomalyDetection in a Mobile Communication Network.”Computational & Mathematical Organization Theory. 13:4,December, 2007.

34 / 38

First Author Publications

I Alec Pawling, Nitesh Chawla, and Greg Madey. “AnomalyDetection in a Mobile Communication Network.” In:Proceedings of the Annual Conference of the North AmericanAssociation for Computational Social and OrganizationSciences. 2006. (Received Best Student Paper Award).

I Alec Pawling, Nitesh Chawla, and Amitabh Chaudary.“Evaluation of Summarization Schemes for Learning inStreams.” In: Proceedings of the 10th European Conferenceon Principles and Practice of Knowledge Discovery inDatabases.” 2006.

I Alec Pawling and Nitesh V. Chawla, and Amitabh Chaudhary.“Computing Information Gain in Data Streams.” In:Proceedings of the ICDM Workshop on Temporal DataMining: Algorithms, Theory, and Applications. 2005.

35 / 38

Other Publications

I Gregory R. Madey, Albert-Laszlo Barabasi, Nitesh V. Chawla,Marta Gonzalez, David Hachen, Brett Lantz, Alec Pawling,Timothy Schoenharl, Gabor Szabo, Pu Wang and Ping Yan,”Enhanced Situational Awareness: Application of DDDASConcepts to Emergency and Disaster Management ”, inInternational Conference on Computational Science, serialLecture Notes in Computer Science (LNCS 4487), Y. Shi, G.D. van Albada, J. Dongarra, and P. M. A. Sloot, Eds., May2007, pp. 1090-1097.

36 / 38

Publication Plan

I Data warehouse implementation (Chapters 3, 4): Plannedsubmission to the International Journal of Data Warehousingand Mining, pending revision.

I Feature clustering approach for event detection (Chapter 7):Planned submission to IEEE Transactions on Patern Analysisand Machine Intelligence, pending revision.

37 / 38

AcknowledgmentsThis material is based upon work supported by the NationalScience Foundation under Grant Nos. CNS-0540348 andBCS-0826958, Northeastern University, and the Defense ThreatReduction Agency under Grant BRBAA07-J-2-0035.

I would also like to thank:I My advisor: Dr. Madey.I My committee: Dr. Chawla, Dr. Chaudhary, and Dr.

Poellabauer.I The outside chair: Dr. Hachen.I Colleagues in Dr. Madey’s Group: Tim Schoenharl, Ping Yan,

Ryan Bravo, and Ryan McCune.I Dr. Barabasi and the members of the Center for Complex

Network Research at Northheastern University. In particular, Iam grateful to Dr. Marta Gonzalez, Dr. Julian Candia, Dr.Sune Lehmann, Dr. Jim Bagrow, Nick Blumm and DashunWang.

38 / 38