NETWORK MONITORING AND DATA ANALYSIS IN ......wireless network forensics. Then, I explore how WiFi...

NETWORK MONITORING AND DATA ANALYSIS INWIRELESS NETWORKS

By

Yongjie Cai

A Dissertation Proposal Submitted to the Graduate

Faculty in Computer Science

in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY,

The Graduate Center, City University of New York

2014

Doctoral Committee:

Prof. Ping Ji, City University of New YorkProf. Spiros Bakiras, City University of New YorkProf. Changhe Yuan, City University of New YorkProf. Hanghang Tong, Arizona State University

c� Copyright 2014

by

Yongjie Cai

All Rights Reserved

ii

CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Network Measurement in Security Monitoring of WiFi Networks . . . . . 31.1.1 Security Monitoring for Wireless Network Forensics . . . . . . . 31.1.2 Wireless Monitoring for Home Security . . . . . . . . . . . . . . 4

1.2 Data Analysis: Pattern Discovery and Missing Value Recovery . . . . . . 51.2.1 Usage Pattern Discovery in Smartphone Apps . . . . . . . . . . . 51.2.2 Missing Value Recovery in Wireless Sensor Network . . . . . . . 6

1.3 Structure of This Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Security Monitoring in WiFi Networks . . . . . . . . . . . . . . . . . . . . . . 8

2.1 SMoWF: Security Monitoring for Wireless Network Forensics . . . . . . 82.1.1 Overview of SMoWF . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Traffic Capture and Preprocess . . . . . . . . . . . . . . . . . . . 112.1.3 Evidence Preservation and Post-investigation . . . . . . . . . . . 122.1.4 Experiments and System Prototype . . . . . . . . . . . . . . . . 132.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Beagle: Physical Intruder Detection using WiFi networks . . . . . . . . . 182.2.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 MAC-based Detection . . . . . . . . . . . . . . . . . . . . . . . 212.2.4 RSSI-variance-based Detection . . . . . . . . . . . . . . . . . . 272.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3. Pattern Discovery and Missing Value Recovery in Data Analysis . . . . . . . . 30

3.1 Understanding the Dynamic Landscape of Smartphone App Usage . . . . 30

3.2 Fast Mining of a Network of Coevolving Time Series . . . . . . . . . . . 313.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 33

iii

3.2.3 Dynamic Contextual Matrix Factorization . . . . . . . . . . . . . 343.2.4 Preliminary Result . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4. Proposed Work and Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Outline of Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Timeline for Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv

LIST OF TABLES

2.1 Study of Popular Smartphones’ Probing Behaviors. . . . . . . . . . . . . . 23

3.1 Symbols and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

LIST OF FIGURES

1.1 Task Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 A typical wireless monitoring system . . . . . . . . . . . . . . . . . . . . . 9

2.2 Testbed and Walking Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Distribution of Observed 802.11 Packet Types. . . . . . . . . . . . . . . . . 14

2.4 Average Error Distances for Device Localization. . . . . . . . . . . . . . . 15

2.5 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Active Scanning Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Regular Used Smartphones’ Waiting Times in Worst Case Scenarios. . . . . 23

2.8 Study of signal strength from probe message. . . . . . . . . . . . . . . . . . 24

2.9 MAC-based Detection: Detecting intruders carrying WiFi devices. . . . . . 25

2.10 RSSI-variance-based Detection: Detecting intruders without carrying WiFidevices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.11 Effects of moving people on RSSI of Beacons (x axis is time in second, yaxis is RSSI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 An illustrative example of a network of time series. (a) A simplified sensornetwork. (b) Measured temperature time series, each of which correspondsto a node with the same color of the sensor network in (a). The multiple timeseries in (b) are inter-connected with each other by its embedded network in(a). (Best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Comparison between the baseline solutions (a and b) and the proposed for-mulation (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Evaluation on missing value recovery in Motes Dataset (n=54,T=14400).Lower is better. DCMF outperforms all of the competitors. . . . . . . . . . 45

vi

ABSTRACT

Various wireless network technologies have been created to meet the ever-increasing de-

mand for wireless access to the Internet, such as wireless local area network, cellular

network and sensor network and many more. The communication devices have trans-

formed from large computational servers to small wireless hand-held devices, ranging

from laptops, tablets, smartphones to small sensors. The advances of these wireless net-

works (e.g., faster network speed) and their intensive usages result in an enormous growth

of network data in terms of volume, diversity, and complexity. All of these changes have

raised complicated network measurement and management issues.

In this proposal, I first investigate the impact of wireless local area network in home

and network security monitoring. Then I propose effective and efficient approaches in

analyzing network data, particularly those generated by smartphone apps and sensor net-

works.

vii

CHAPTER 1

Introduction

Various wireless network technologies have been created to meet the ever-increasing de-

mand for wireless access to the Internet, such as wireless local area network, cellular net-

work and sensor network and many more. The communication devices have transformed

from large computational servers to small wireless hand-held devices, ranging from lap-

tops, tablets, smartphones to small sensors. It is reported that wireless traffic accounted

for the majority of IP traffic at 44 percent in the last year. It is expected that traffic from

wireless and mobile devices will exceed traffic from wired devices by 2018, and by 2018,

wired devices will account for 39 percent of IP traffic, while WiFi and mobile devices

will account for 61 percent of IP traffic [1].

With easy deployment and low costs, WiFi networks provide Internet access at

homes, university and corporate campuses, and public places such as libraries, parks, and

coffee shops. While people enjoy the convenience of connecting electronic devices, such

as laptops, tablets, smartphones, video-game consoles and digital cameras to the Internet

via wireless access points, the wireless channels are prone to different types of attacks and

subject to performance problems such as congestions and high packet losses all the time.

Therefore, monitoring network activities in Wireless Networks to facilitate security and

network performance management is an important area of research. In my work, I will

explore the infrastructure and protocol design of network security monitoring for WiFi

networks.

Another type of fast evolving wireless networks is Cellular Network. With a rapid

pace, traditional landline telephones are being replaced by Voice over Internet Protocol

(VoIP) phones or cellphones. With the amazing smartphone technologies, the younger

generations have almost dumped landline phones (with either PSTN or VoIP service)

1

2

completely. Many statistics have shown that a significantly less number of households

own landline phones, and smartphones and other wireless enabled smart devices have

claimed a dominate role in remote voice communication [6]. It is also an irreversible

trend that the smartphones are replacing desktop computers or even laptops. We have seen

students doing their homework on smartphones, commercial stores swiping customers’

credit cards on smartphones and doctors writing prescriptions on smartphones. There-

fore, understanding how people use their smartphones and the applications is valuable for

social scientists to study human behaviors, and provides insight for app developers, data

service carriers and content providers from both technical and commercial perspectives.

In addition, a third type of wireless networks - Sensor Network has become an

important component in data communications as well. Research papers and practical

projects have been proposed and conducted in various types of application scenarios for

sensor networks. It ranges from networks of sensors with single and simple modalities

(e.g., temperature, air pressure, acoustic) to complicated networks of sensors that provide

data of mixed modalities and continuous readings. For example, sensor networks are

deployed to monitor the concentration of dangerous gases in the cities, detect forest fires,

monitor water quality in industries, measure temperature, humanity, and light both in

industries and residential houses, etc. [11]. These networks generate a large amount of

complex data in the format of coevolving multiple time series, which is very challenging

to manage. In this thesis, I will investigate the data characteristics of such complicated

sensor networks.

As Figure 1.1 shows, my thesis work focuses on the network measurement and

data analysis perspectives of three different types of wireless networks, which are tightly

related to people’s life.

3

People’s Life

WiFi Networks

Security Monitoring

Wireless Network ForensicsChapter 2.1

Home SecurityChapter 2.2

Cellular Networks(Smartphone Apps)

Usage Behavior StudyChapter 3.1

Wireless Sensor Networks

Time Series ModelingChapter 3.2

Figure 1.1: Task Breakdown

1.1 Network Measurement in Security Monitoring of WiFi Networks

The 802.11 based wireless network (known as WiFi) works as a ubiquitous com-

munication infrastructure because of its capabilities of low cost, easy deployment, high

network bandwidth and its inherent convenience for mobile devices. Many places such as

university campuses, corporate and government offices, public parks, or even entire urban

cities are blanketed with Internet access through numerous access points (APs).

In the first part of this proposal, I investigate the impact of wireless local area net-

work on security monitoring. I first identify the challenges in security monitoring resulted

from prevalent WiFi networks and propose a distributed security monitoring system for

wireless network forensics. Then, I explore how WiFi networks can be used to improve

home security.

1.1.1 Security Monitoring for Wireless Network Forensics

As wireless networks and mobile devices are claiming dominant roles in modern

communication technology, the manifestations of digital crimes have started integrating

wireless elements as well, which makes digital forensic investigation unprecedentedly

difficult. For example, a hacker may drive on the street, randomly pick an open WiFi

network, conveniently connect to the access point, upload or download malicious files

4

through the access point, then close the session and drive away. The whole process may

only take minutes to accomplish, and when the victim machine notices the attack, the best

point of interest that it can trace back is very likely only the benign access point, through

which the true attacker conducted the malicious activity. It is almost always certain that

the hacker will be cut loose.

In this work, a distributed Security Monitoring system for Wireless network Foren-

sics (SMoWF) is proposed to monitor Wireless LAN activities. Abstracts of network

traces are captured and selectively recorded at each monitoring point. Distributed mon-

itoring points collaborate to reconstruct the crime scene based on stored logs, and the

SMoWF system aims to answer an important investigation question: which device ap-

peared at where during what time.

1.1.2 Wireless Monitoring for Home Security

Home security is a big concern to everyone. Current home security systems rely on

many sensors [27, 29], such as door and window sensors, motion sensors, heat sensors,

etc. It is complex to install tens of such sensors. The cost to purchase and maintain these

sensors would be very high. Hence, many home owners are not willing to equip their

homes with security systems.

On the other hand, smartphones with WiFi turned on actively send probe request

frames with MAC addresses to search for available WiFi networks. MAC addresses are

already used to profile and track actual and potential shoppers and customers for stores

by commercial companies [2,7,10]. In addition, the signal propagations of wireless links

would be affected by people moving around, which motivates lots of research work on

detecting and tracking moving objects using wireless signals [18, 36, 47–49, 51].

Based on these two facts, I propose to design and develop a low-cost home secu-

rity system called Beagle to detect unexpected physical intruders, prevent property loss

and keep homes safe. Unlike traditional home security systems relying on placing sets of

5

sensors inside the house, Beagle is based on existing WiFi infrastructures with ubiquitous

WiFi signals penetrating the walls and thus extending our senses. With all the functional-

ities including monitoring, detection, and alarming built into one box, it is easy for users

to deploy and use the Beagle system.

1.2 Data Analysis: Pattern Discovery and Missing Value Recovery

Along with the fast advances of various wireless network technologies, the network

data have increased dramatically in terms of volume, diversity, and complexity. In the

second part of this proposal, I investigate data analysis approaches that are effective and

efficient in examining network data generated by smartphone apps and sensor networks.

1.2.1 Usage Pattern Discovery in Smartphone Apps

The use of smartphones has gone through a booming growth over the past few years

and become an integral part of people’s daily life. Through a plethora of smartphone apps,

people are indulging in entertainment, staying connected in social life, acquiring news

and information, conducting business activities, and managing nearly every aspects of

personal life — health, travel, education, finance, shopping, banking, dating, etc. Under-

standing how people use their smartphones and apps, the prevalence of various kinds of

smartphone apps, and their usage patterns and trends, is a great way for social scientists to

learn about the modern society [39]. It can also provide valuable insights for application

developers, service carriers and content providers to deliver a better user experience [38].

In this study, I analyze two-year long anonymized traffic data of smartphone us-

ages from both a major US-based cellular service provider and a large regional residential

DSL service provider. The analysis provides valuable insights on the usage patterns and

trend changes of a large variety of smartphone apps that people use nowadays. I compare

and contrast the usage patterns when smartphone users are mobile (accessing via cellular

network) and when they are at home (accessing through home WiFi). I also discover in-

6

teresting phenomena such as a version upgrade of an app resulting in a disruptive pattern

in traffic volume but not in its distinct user number. I further develop an automated alert-

ing system of such disruptive patterns for cellular service providers, allowing a timely

attention and response to the potential performance problems that they introduce.

1.2.2 Missing Value Recovery in Wireless Sensor Network

Wireless sensor networks have become an essential part of computer networks and

have been applied in many areas, ranging from environmental monitoring, industrial mon-

itoring, home security to physiological signaling in health care applications and many

more. These applications usually deploy a number of sensor nodes to collaboratively ac-

complish their tasks. The sensor nodes are prone to failures, and the topology of a sensor

network might change frequently. All of these characteristics of wireless sensor networks

make the analysis management of sensor data very challenging. In this work, I focus on

modeling data generated from wireless sensor networks, specifically, data in the format

of coevolving multiple time series with missing values.

In many scenarios, the multiple time series data is often accompanied by some con-

textual information in the form of networks. For example, in a smart building equipped

with various sensors, each sensor generates a time series by measuring a specific type

of room condition (e.g., temperature, humidity, etc.) over time. All these sensors or

time series are connected with each other by the underlying sensor network; in epilepsy

study, the various sensor measures collected from different brain regions of a patient are

essentially inter-connected by his or her brain graph; in scholarly article citation analy-

sis, the citations of different journals or conferences are correlated particularly for those

from similar domains. Such multiple time series, together with its embedded network are

referred as a network of coevolving time series.

In order to unveil the underlying patterns of a network of coevolving time series, in

particular to fill in the missing values of one or more input time series, I adapt the concept

7

of collaborative filtering that multiple coevolving time series and/or the contextual net-

work are determined by a few latent factors, and propose DCMF, a dynamic contextual

matrix factorization algorithm to simultaneously find the common latent factors in both

the input time series and its embedded network.

1.3 Structure of This Proposal

The rest of this proposal is organized as follows: I presents the studies of security

monitoring in WiFi Networks in Chapter 2. In Chapter 3, I study usage behaviors of

smartphone apps, and propose an effective and efficient algorithm to model coevolving

multiple time series generated by sensor networks. Then, I describes the proposed work

motivated by the preliminary results of Chapter 2 and Chapter 3, and presents the time-

line to complete the work in Chapter 4. Finally, I make conclusions of this proposal in

Chapter 5.

CHAPTER 2

Security Monitoring in WiFi Networks

2.1 SMoWF: Security Monitoring for Wireless Network Forensics

Cybercrime is an exploding security challenge in the current digital age, and has

been largely concerned by the public over the past several decades. With the escalating

deployment of WiFi networks, the accelerated usage of mobile devices, and the dynamic

physical and protocol characteristics of wireless communication, wireless links have be-

come an increasingly popular channel for cyber criminals to camouflage their true iden-

tities. For example, a malicious hacker may walk on the street, randomly pick an open

Access Point (AP) such as one from a coffee shop, conveniently connect to the AP, upload

and/or download malicious files or send commands through the AP to Zombie machines

compromised for a distributed attack, then close the communication session and take off

freely. The whole process may only take minutes to accomplish. No one would even

notice what has happened until the victim device(s) detects an intrusion. When an at-

tack has been identified, the investigators will easily run into the situation where the best

point-of-interest that can be traced back is the benign access point, through which the true

attacker conducted the malicious activity. Since most WiFi users do not regularly keep

security logs and monitor the activities of their wireless network, it is almost certain that

the hacker is off the hook.

An intuitive way is to ask users to keep SNMP logs in APs and record partial traffic

statistics such as application workloads, and user sessions by some traditional measure-

ment methods applied on the wired part of WiFi network traffic [14,23,42,45]. However,

only limited knowledge can be obtained in the wireless part (PHY/MAC layers under IP

layer) through such wired measurement efforts. Researchers also have proposed several

8

9

wireless monitoring infrastructure systems, primarily for improving wireless channel and

protocol performance. Figure 2.1 demonstrates a framework of a typical wireless mon-

itoring system, which consists of three main components: wireless network monitoring

points, a data repository and a centric processing engine. Each monitoring point gathers

network information by passively capturing traffic in the air, and transmits the gathered

raw data to the repository. The centric processing engine conducts network analysis and

reports events of interest to network operators. DAIR [12], Jigsaw [19] and Wit [33]

are three such systems built on this infrastructure. These two monitoring methods might

work in the organizations with network specialists, such as corporates and universities.

However, it is impractical to expect users to do similar network administrations at home.

!

Monitoring!Point! Monitoring!Point!

Repository!Centric!

Processing!Engine!

!

User!Interaction!

…!

Figure 2.1: A typical wireless monitoring system

In this work, I propose to design a Security Monitoring system for Wireless Net-

work Forensics (SMoWF), which aims to keep trace digests of WiFi activities and to an-

swer the important investigation question: which device appeared at where during what

time.

2.1.1 Overview of SMoWF

The emerging and increasing growth of WiFi wireless networking technology makes

it possible to connect to the Internet from anywhere at anytime. For example, in a wire-

less network measurement study [17], we conducted experiments around a three-block

metropolitan neighborhood of the upper-west side of Manhattan, and observed over 3000

access points in the neighborhood. The densely deployed WiFi networks are undoubtedly

10

making our life much easier and more enjoyable, but they also provide more opportunities

for malicious users to conduct criminal activities through mobile devices. We notice that

among our observed access points, about 30 percent provide unencrypted WiFi services.

In other words, these open networks can be easily compromised. In this work, I propose

a security monitoring infrastructure for wireless network forensics (SMoWF), which is

to build an intelligent monitoring system that can uncover malicious devices, track their

activities in Wireless LAN of metropolitan areas, and preserve digital evidence to facili-

tate future cyber crime investigation. The SMoWF system should be able to answer the

following questions:

• Whether or not a particular device was involved in a given malicious network ac-

tivity?

• Can this device be uniquely identified by the logs?

• Where was a particular device physically located during a given event?

Similar to Figure 2.1, the SMoWF system consists of a set of monitoring points that are

responsible to capture wireless network traffic. These monitoring points are distributed

through a wireless network and may be moved around to cover Wireless LANs as much

as possible. After the collection of raw traffic data, SMoWF parses raw data into human-

readable texts, eliminates irrelevant traffic types and extracts useful information for device

identification and localization. It also removes the payload (i.e., data part) of traffic pack-

ets to protect users’ privacy. SMoWF uses a central repository to store processed data as

digital evidence. Finally, it includes a post-investigation engine that helps investigators to

figure out what was going on when a criminal activity occurred. The post-investigation

engine retrieves relevant data from the evidence repository and is able to answer the afore-

mentioned questions.

11

2.1.2 Traffic Capture and Preprocess

Comparing to those monitoring systems deployed in buildings or universities [26]

[19], there are several challenges of wireless traffic monitoring in a metropolitan area:

1) the number of access points that are observable (i.e., accessible to malicious users) is

very large; 2) the locations and distributions of access points are unknown; 3) we do not

have permissions to administrate these access points. Therefore, the traditional ways of

obtaining traffic are not practical. We cannot configure all of these access points to log

their real-time traffic, nor can we deploy thousands of static stand-alone monitoring nodes

to cover the whole area.

The SMoWF system delegates the traffic capturing tasks to wireless monitoring

points, such as laptops, tablets or even smartphones, being either stationary or mobile.

These monitoring points passively capture nearby wireless network traffic, and periodi-

cally upload the encrypted or hashed traffic logs to a central repository. Particularly, in

our experiments, we use Kismet [4] installed on a MacBook Pro to gather raw wireless

network traffic. Kismet is a 802.11 wireless network sniffer working with any wireless

card supporting monitoring mode, and detects networks by passively collecting network

packets. It can provide GPS coordinates where packets are observed when integrated with

a GPS device. Kismet generates two important log files - pcapdump and gpsxml. All the

packet information above the MAC layer and the Per-Packet Information (PPI) header

that includes the transmitting channel, received signal and noise strength, are logged in

pcapdump files. GPS information such as coordinates and speed are recorded into gpsxml

files. Kismet gathers raw packets into pcapdump files and records monitoring locations

into gpsxml files. Then, tshark [9] is used to parse and extract selected fields of raw

packets into human-readable texts.

12

2.1.3 Evidence Preservation and Post-investigation

For evidence preservation, it is not efficient enough to reduce data storage size by

only extracting packet headers. The network traffic size can be huge compared to limited

storage capabilities. For instance, in Section 2.1.4, 362,305 packets (i.e., about 311MB

trace data) were collected by randomly walking around a three-block neighborhood for 12

trips lasting around four hours. These data came from only one single monitoring point.

If tens or hundreds of monitoring points participate, it can easily get to 10-100 GB traces

in one day. For instance, Jigsaw [19] collected 96GB 802.11 raw traces in one day using

192 radio monitors.

I make statistic analysis on the packet types among the collected trace. As the result

in Section 2.1.4 presents, half of the packets are beacon frames sent from access points.

The beacon frames are used to claim the existence of WiFi networks and are not related to

the associated clients. Therefore, the beacon packets can be filtered out. Other extracted

information is stored into a database in order to support efficient queries and conduct post

forensics investigation.

The post-investigation component aims to answer the questions described in Sec-

tion 2.1.1, which can be summarized as device identification and device localization prob-

lems. As a preliminary work for the SMoWF system, MAC addresses are used as the

unique identifiers of mobile devices. For device localization, I study and evaluate two

localization algorithms: the weighted centroid algorithm [20] and the log-distance path

loss modeling method [40]. The weighted centroid algorithm simply estimates the loca-

tion of a target device as the weighted sum of all locations where it is observed, described

in Equation 2.1. p is the estimated location of the target device, pi

is the ith location co-

ordinate where the device is observed, and the weight wi

is proportional to si

the signal

13

strength received from the target device at ith location.

p =

X

i

wi

pi

, wi

/ si

,X

i

wi

= 1. (2.1)

The log-distance path loss modeling method describes that the average received

signal strength decreases logarithmically with distance, shown in Equation 2.2.

si

= s0 � 10� log di

+X�

, (2.2)

where si

is the received signal strength at Position i. di

is the physical distance from

the target device with coordinates (x, y) to the monitoring point with coordinates (xi

, yi

).

The path loss exponent � indicates the loss rate of the received signal strength. s0 is the

received signal strength at a distance of one meter. To compensate for the random shadow-

ing effects in signal propagation, X�

is added as a zero-mean Gaussian distributed random

variable with standard deviation �. Theoretically, four monitoring points or monitoring

locations are required to determine four parameters (s0, �, x, y) of the target device so as

to know the location coordinate (x, y). However, in practice, the target device is observed

at more than four locations, which results in a set of over-determined equations. There are

several solutions to this problem. For example, Krishna [21] proposed to find solutions to

minimize the least mean absolute error of the equations. In the SMoWF system, to sim-

plify the implementation, the trust-region-reflective optimization approach implemented

in Matlab is used to minimize the least square error, which is defined as follows:

J =

X

i

(si

� s0 + 10� log di

)

2 (2.3)

2.1.4 Experiments and System Prototype

Trace Storage To better understand the wireless traffic, I conduct several war-

14

Figure 2.2: Testbed and Walking Path.

walking experiments in a three-block metropolitan area of the upper-west side in NYC

shown in Figure 2.2, which is around 260m*260m. I use a MacBook Pro laptop with

internal airport wireless card and equipped with a BU353 GPS receiver as a moving mon-

itoring point. Kismet, installed on the MacBook Pro, is configured to hop on channels

to cover the entire spectrum and log all received wireless packets. I walked around the

neighborhood for 12 trips along the path of A-H or H-A in a week of April of 2011. We

collected 362,305 packets around 311MB traces.

Figure 2.3: Distribution of Observed 802.11 Packet Types.

The distribution of packet types of these traces is shown in Figure 2.3. As can be

observed, half of the packets are the beacon frames. The beacon frames are sent from

access points and do not provide information about any mobile device. They make little

15

Figure 2.4: Average Error Distances for Device Localization.

contribution to forensics investigation. Therefore, the beacon packets can be left out or

only preserved periodically. On the other hand, when a user is uploading or downloading

an image or a document or a video, there are bunches of data packets in the session.

Several of these packets in the same session are enough for the identification and tracing

back processes. Flow or session identifying techniques (e.g. sequence number) can be

used to identify the start and the end of a session, then only selected packets are preserved

in the repository as digital evidence.

Device Localization To evaluate the two localization algorithms described in Sec-

tion 2.1.3, I conduct the following experiments. Two smartphones are used as target

devices and I choose five test locations where they are able to connect to nearby WiFi

networks. At the same time, I collect their communication traces using one monitoring

point. To get the ground truth of the test locations, I run the GPS receiver for 2 minutes

statically at these locations and average the coordinate readings as the ground truth. For

each test location, I use the Haversine formula [3] to calculate the error distances between

the ground truth and estimated coordinates of the two devices through two algorithms.

As Figure 2.1.4 shows, the average error distances for two devices are around 31 and

40 meters using the weighted centroid method, 44 and 87 meters using the log-distance

path loss modeling approach. Hence, the SMoWF system applies the simple but effective

16

weighted centroid algorithm.

System Prototype As described before, a MacBook Pro laptop with built-in airport

wireless card and an external BU353 GPS receiver is used as a moving monitoring point.

It runs Kismet, which is configured to hop on channels to cover the entire spectrum.

After trace collection, raw data are parsed and selected packet fields are extracted

into our database. These fields contain frame date and time, MAC addresses including

source address, destination address, BSSID, transmitter address, and receiver address,

data length, channel frequency, received signal strength, type and subtype of 802.11, IP

source and destination addresses, source and destination port of TCP and UDP. The pay-

loads of these packets are dropped. Notice that a packet does not always include all the

fields. For example, 802.11 Acknowledgement and Clear-To-Send packets only contain

receiver MAC address and no other MAC address. One packet only has source or desti-

nation port either from TCP or UDP. Neither TCP nor UDP information can be obtained

from encryption packets. Furthermore, the location information of the monitoring points

are extracted into the database, which contains date, time, source MAC address, signal,

noise, latitude, longitude, altitude, fix, speed, heading. Indices are created on date fields

to speed up queries.

A simple web user interface is developed to help investigators to trace back their

interested device. As shown in Figure 2.5, an investigator can enter a date and time

period of an interesting event, and/or a MAC address, IP, BSSID. The SMoWF system

then pulls out the records that are related to the input information from the database,

shown in Figure 2.5(a). Meanwhile, it estimates the locations of the tracked device with

five-minute window using the weighted centroid algorithm, shown as Figure 2.5(b). It

also generates a KML file that visualize the geo locations of the device in Google Earth,

shown in Figure 2.5(c). In this way, the investigators can easily track the locations of their

interested device.

17

(a)

(b) (c)

Figure 2.5: System Prototype

2.1.5 Summary

In this work, a wireless forensic monitoring system (SMoWF) is outlined, which

aims to establish a forensic database based on wireless trace digests, and to answer the

following investigation questions: 1. Was a particular device involved in a given malicious

network activity? 2. Can this device be uniquely identified by the logs? 3. Where a

particular device was physically located during a given event. I conducted research and

experiments for the following tasks: 1. Design network trace logging method that records

the abstract of useful fields of network packets. Here only abstracts of packets are kept for

privacy protection purpose. 2. Design a query/search system that allows users to conduct

forensic analysis activities based on monitored traces; 3. Study localization algorithms

that can provide the location information of a given device.

18

2.2 Beagle: Physical Intruder Detection using WiFi networks

Home security is a big concern to everyone. Current home security systems rely on

placing various types of sensors to monitor different information about a house [27, 29].

For example, door and window sensors are installed to detect the open or close status

of doors and windows; motion sensors are placed in critical locations to detect human

movements; other monitoring sensors such as fire and smoke sensors, water sensors, tem-

perature sensors, etc. are deployed to ensure the house being kept in a good condition.

However, the installations and maintenance of such sets of sensors are very complex and

time-consuming. In addition, the cost of these home security systems is very high. Many

house owners will not buy and use these systems.

On the other hand, the use of smartphones has become an integral part of people’s

daily life. Smartphones with WiFi turned on actively send probe packets to search for

available WiFi networks. The probe packets contain MAC addresses of smartphones,

which can be an unique identification of a user. Some marketing companies and location

analytics firms such as Euclid Analytics [2] and Turnstyle Solutions [10] use MAC ad-

dresses to profile and track actual and potential shoppers and customers for stores. 100

recycling bins were installed in the city of London before the 2012 Olympics to monitor

the phones of passers-by using the same technique [8]. Similarly, Presence Orb [7] ap-

plies it to detect if buses or bars are full. Furthermore, when a person moves around the

area of wireless networks, the signal propagations of the wireless links would be affected,

which leads to the the variation of measured received signal strength (RSS).

In this work, I utilize two aforementioned facts, and design a low-cost home security

system called Beagle to tackle one of the most important tasks in home security - physical

intruder detection, which aims to detect unexpected physical intruders, prevent property

loss and keep homes safe. Unlike traditional home security systems relying on placing a

bunch of sensors inside the house, Beagle is based on existing WiFi infrastructures with

19

ubiquitous WiFi signals penetrating the walls and thus extending our senses. With all

the functionalities (e.g., monitoring, detection, alarming) built into one box, it is easy for

users to deploy and use Beagle.

2.2.1 Design Considerations

In this section, I discuss some typical scenarios in physical intruder detections for

home security, and then describe how we can deal with these scenarios with the design of

Beagle. The detailed design is described later.

Let us consider the six following typical scenarios in physical intruder detections:

1. The owner is in the house;

2. The owner has a guest in the house;

3. An unknown person passes by the house within a short time;

4. An unknown person is in the neighbor’s house;

5. An unknown person is around the house for a relatively long time.

6. An unknown person is in the house;

Intuitively, the first three cases are normal (i.e., no intruders), and happen a lot in our

daily life. For example, in the third case, neighbors or simply strangers may pass by our

house many times, but the time they stay is very short. In addition, we won’t care about

whether there are thieves in the neighbor’s house, described as the fourth case. Hence, the

Beagle system should not alarm the user for the first four scenarios. On the other hand,

when some unknown person is lingering around a house (i.e., Scenario 5), the owner of

the house probably will want to know that. When an intruder or a stranger is in the house

(i.e. Scenario 6), the owner definitely would like to be alarmed.

With the ubiquity of 802.11 based wireless networks, localization using wireless

signals is becoming a popular research topic. Many methods are proposed to achieve

high accuracy in positioning static or/and moving subjects. However, these methods only

focus on answering questions such as whether there is a subject and where it is. They

20

do not consider whether this subject is expected or not. Hence, these methods cannot be

directly applied in our scenarios.

The Beagle system is based on the residential WiFi networks to detect unexpected

physical intruders and stay armed for home security. Generally, Beagle automatically

learns and maintains a whitelist that contains users’ identifications (i.e., MAC addresses

of users’ smartphones), which is used to differentiate an authorized or known person from

a suspicious or unknown person. It further infers whether an unknown person is expected

from the presence of the house owners. Also, Beagle sets up RSSI thresholds to infer

whether the person (i.e. the target device) is in the area of interests, such as users’ house

or neighbors’ houses. Finally, Beagle combines the effects of moving people on RSSI to

detect the intruder entering into the house.

2.2.2 System Overview

Based on the aforementioned design considerations, the Beagle system should be

able to a) differentiate between non-intruders (e.g., friends, people pass by) and the actual

intruders, b) detect the actual physical intruders; c) send alerts to the user instantly in

b)’s condition. To accomplish these tasks, the Beagle system is designed with three main

components: Packet Capture and Information Extraction, Intruder Detection, and Alert

System.

In Packet Capture and Information Extraction, Beagle monitors and collects WiFi

frames around a house by placing a WiFi monitoring point at an important location in the

house (the places that intruders will enter into, e.g., the living room, the hallway). Mean-

while, it extracts and analyzes certain significant fields such as device identifications (i.e.,

MAC addresses) and received signal strengths for intruder detection. In Intruder Detec-

tion component, Beagle checks and models some extracted fields, and triggers the alert

system when any intruder is detected. In Alert System, house owners will be informed

with instant message (e.g., gtalk in current implementation) upon the detection of any

21

intruder.

The Intruder Detection part is the key component of the Beagle system. As men-

tioned before, two facts are utilized to design and implement this component. Firstly,

by monitoring the probe packets sent from nearby smartphones, any suspicious MAC

address coming from a potential intruder can be detected, which derives a MAC-based

Detection approach. Secondly, it can be detected the intruder’s activities in the house

through the resulted abnormal RSS variances in WiFi networks, which derives the design

of an RSSI-variance-based Detection approach.

2.2.3 MAC-based Detection

The MAC-based detection method aims to detect intruders carrying WiFi devices.

As we described, a WiFi-enabled device actively broadcasts probe request to search avail-

able networks. The probe request packet contains the MAC addresses of the device, which

serves as an identification of the owner of the device. By looking at its received signal

strength indicator (RSSI), it is possible to infer the approximate distance of the target

device. To design the MAC-based detection method, two measurement-based studies are

conducted to answer the following critical questions:

• How can we capture the probe packets if any?

• How can we tell the target person is inside or outside?

a) Study of smartphones probing behaviors

Figure 2.6: Active Scanning Procedure

22

Figure 2.6 demonstrates a simplified version of active scanning procedure through

probe request frames in 802.11 protocols. Generally, upon the receipt of scanning request

from operating system, a mobile device starts a probe cycle and scans its supported chan-

nels by sending out probe requests to each channel one by one. During this probe cycle, it

waits for some time (i.e., probe delay) between sending any two probes. If any response is

received, it will process the probe responses. Otherwise, it waits for the next scan request.

In this procedure, many parameters (e.g., the order of the scanned channels, the

number of probes sent on each channel, and the probe delay) that would affect the packet

capture vary with device models. To infer these parameters, I conduct a probing behavior

study on five popular smartphone models on market, including iPhone 4S, iPhone 5 and

5S, Samsung Galaxy S5 and HTC One (M7). In the study, a sniffer is set up using a

laptop, its built-in WiFi network card and an external WiFi network card. I configure both

cards in monitor mode, with one listening on Channel 1 and the other on Channel 11.

The smartphones are set in asleep status, but the installed apps are still running, and WiFi

networks are turned on. The collection of the probe request frames lasts around one hour.

After the collection, the parameters are inferred based on the fields of probe frames, such

as timestamps, sequence numbers, and channels. For example, the probe delay time can

be derived by calculating the difference between the timestamps of two adjacent probes

in the same probe cycle. The number of the probes per channel can be calculated by

counting the number of probes observed on the same channel in the same probe cycle.

Table 2.1 lists the detailed results. As can be observed, all of the studied mobile

devices support channels of 802.11b/g/n. They send one or two probes on each channel.

The probe delay between two adjacent probes is around 20ms or 40ms. Hence, to ensure

the capture of the probes if any, we need to stay on any channel of b/g/n (i.e., Channel

1-11) for more than two seconds. To make things simpler, the sniffer is fixed with one

monitoring channel in the implementation.

23

Table 2.1: Study of Popular Smartphones’ Probing Behaviors.Device Channels #Channels #Probes Per Channel Probe Delay (ms) One Probe Cycle (ms) Observed Waiting Time

iPhone 4S b/g/n 11 2 20 440 2s-43 minsiPhone 5/5S a/b/g/n 20 1 40 800 2s-10 mins

Samsung Galaxy S5 a/b/g/n/ac 21 2 40 1620 2s-3 minsHTC One (M7) a/b/g/n/ac 21 2 40 1620 4s-30 mins

(a) iPhone 5 (b) Samsung Galaxy 5

Figure 2.7: Regular Used Smartphones’ Waiting Times in Worst Case Scenarios.

As presented in Table 2.1, the waiting time between two probe cycles varies with

different devices. Generally, the newer model is observed having less waiting time. I fur-

ther measure the waiting times of regularly used devices in worst case scenarios. Specif-

ically, I check how often two most popular models, iPhone 5 and Samsung Galaxy S5,

send probes with all installed apps being closed during night time (i.e., no user interac-

tion). Figure 2.7 shows the CDF plots of their observed waiting time intervals. As we can

see, for iPhone5, 95% of waiting times are less than 10 minutes and 85% of waiting times

are less than 5 minutes; for Samsung Galaxy S5, 95% of waiting times are less than 5

minutes. This result indicates that if an intruder stays longer than 5 minutes, the detection

probability is very high.

b) Study of signal strength from probe frames

Wireless signal strengths have been used to learn the locations of the mobile de-

vices. For example, the mobile device localizes itself by comparing the received signal

strengths of APs against the pre-calibrated wireless map [13, 37]. My task is different

24

from those localization problems in two aspects: one is that I aim to track the target de-

vice, rather than that the target device localizes itself; the other is that only one sniffer or

monitoring point is used, so the traditional device localization methods cannot be applied

in the Beagle system.

To answer the second question, I propose to use the received signal strengths of

probe frames to infer the target device’s coarse location. A measurement study is con-

ducted to explore how the RSSIs of the probes of the smartphones change in different

positions around a house. In the study, a sniffer is placed in the living room, and a few

important locations are explored with two smartphone models, iPhone 5 and Samsung

Galaxy S5. At each location, I put the smartphone in the pocket and collect about 100

probes when I face the sniffer and turn my back to the sniffer respectively. As described

before, smartphones would wait for seconds to minutes to start an active scanning, which

would make the measurement study time-consuming. It is noticed that when the smart-

phone is waked up from the sleep status, it starts an active scan for available networks

immediately. Hence, the smartphones are made to switch between awake and asleep con-

stantly to speed up the collection during the measurement.

Figure 2.8: Study of signal strength from probe message.

25

Figure 2.8 shows the CDF plot of the RSSIs of Samsung Galaxy S5. The line colors

denote different places: blue means the smartphone is far away from the sniffer (around 10

meters); magenta denotes it is outside the door (around 4 meters away from the sniffer);

and the black lines are measured when the smartphone is at the door but inside home

(around 4 meters away from the sniffer). The line types denote whether the human faces

or turns back to the sniffer. As can be observed, the RSSIs significantly decrease due to

the presence of human body. To infer the approximate distance of the target device, with

85% confidence, -75 is selected as a low threshold to indicate the device is far away, and

-62 is picked as a high threshold to indicate the device is very close to the sniffer. Similar

results are observed with iPhone 5.

c) The Proposed Algorithm

Figure 2.9: MAC-based Detection: Detecting intruders carrying WiFi devices.

Figure 2.9 demonstrates a general idea of how the MAC-based detection method

works. A sniffer with the capability of monitoring WiFi networks is set up in an important

spot of the house (e.g., the living room). The sniffer keeps capturing probe request frames.

If the frame is from some known MAC, or unknown MAC address but in presence with

26

known MAC address, the detected device comes from the home owners’ or their guests.

If the frame is from short-lived MAC address with very weak RSSI, the detected person

might be innocent. However, if an unknown MAC address with very strong RSSI is

observed and if no ”familiar” MAC address is accompanying it, it probably comes from

an intruder.

Algorithm 1: MAC-based DetectionInput : PR = htimestamp,MAC,RSSIi, RSSI Low,RSSI High

1 if PR.MAC not in Whitelist then/

*

user not at home or current time is at night

time or user asleep at daytime

*

/

2 if ! isUserAtHome() or isNight() or noMovement() then3 if PR.RSSI > RSSI High then4 alert();5 end6 else7 if PR.RSSI < RSSI Low then8 return;9 end

10 else11 if PR.MAC in Probelist && The time interval from last probe

from PR.MAC> 5 mins then12 alert();13 end14 else15 Save PR in Probelist;16 end17 end18 end19 end20 end

Accordingly, Algorithm 1 describes the implementation of the MAC-based detec-

tion approach. At the beginning, the MAC addresses of the house owners’ smartphones

are input, which is used to infer whether the house owners are at home (isUserAtHome()).

A night time period is set up to determine the activation of the Beagle system (isNight()).

During the daytime, the detection is activated only when there is no movement in the

27

house (noMovement(), implemented based the RSSI-variance-based method). A whitelist

(Whitelist) is used to store trusted devices, and a probe list (Probelist) is used to main-

tain the MAC addresses of probes in recent half an hour. Besides, the observed probe

request is represented as a triple PR = htimestamp,MAC,RSSIi. As Algorithm 1 de-

scribes, the Beagle system stays armed in the circumstances that a) the user is not home;

b) during night time; c) it is daytime but the user is asleep. In the armed status, if a

device that does not belong to the whitelist is detected with an RSSI exceeding the high

RSSI threshold (RSSI High), or it stays longer than 5 minutes with relatively high RSSI

(RSSI Low PR.RSSI RSSI High), the Beagle system would send alert to the

house owner.

2.2.4 RSSI-variance-based Detection

As Figure 2.10 demonstrates, when an object moves around the area covered by

wireless networks, the transmissions of both line-of-sight links and non-line-of-sight links

would be affected. Both would lead to the the variations of measured received signal

strength (RSS) on the links. Based on this fact, an RSSI-variance-based detection method

is proposed to detect physical intruders, to compensate for the limitation of the MAC-

based detection approach – an intruder cannot be detected if he or she does not carry any

WiFi enabled device.

Figure 2.10: RSSI-variance-based Detection: Detecting intruders without carryingWiFi devices.

28

Although plenty of research work propose intelligent algorithms to track and local-

ize a moving object based on the variance of RSSIs it makes [48,51], most of them deploy

a network of sensor nodes in the monitoring area and assume an ideal environment (i.e.,

no other object except the target one is in the area). It’s hard to meet the above require-

ments in home environment. Instead, access points in home WiFi networks broadcast

beacon frames at a high frequency (usually one beacon frame every 100ms) to announce

the presence of WiFi networks, which can be utilized to detect an intruder.

a) Effects of moving people on RSSI

As a first step, I study the impact of moving people on the RSSIs of beacon frames.

The experiments are set up as follows: a sniffer is placed in the living room, one AP (AP1)

is in the same room and the other AP (AP2) is in a different room (e.g., a bedroom). Both

APs and the sniffer are set working on the same channel. The sniffer is started and the

house is left empty for around 10 minutes. Then, I enter into the house, and walk around

the living room and the room with AP2 placed in to simulate an intruder’s behavior.

(a) AP 1 (b) AP 2

Figure 2.11: Effects of moving people on RSSI of Beacons (x axis is time in second,y axis is RSSI).

Figure 2.11 demonstrates the measured RSSIs of beacons from AP1 and AP2. It

can be observed that the RSSIs from both APs are noisy. Due to multi-path effects, two

29

sets of dominant RSSIs (around -50 and -60) are observed from AP2. In addition, it is

observed that with a moving person, the RSSIs of both APs change significantly around

the 10th minute.

b) The Proposed Method

Based on the previous preliminary results, I plan to propose an effective method to

detect an intruder.

2.2.5 Summary

In this work, I propose a low-cost home security system based on the existing WiFi

infrastructures to detect unexpected physical intruders. Specifically, I explore two main

methods - the MAC-based detection approach and the RSSI-variance-based detection ap-

proach to detect intruders with/without WiFi devices. Preliminary results demonstrate

both methods would be feasible and effective.

CHAPTER 3

Pattern Discovery and Missing Value Recovery in Data Analysis

3.1 Understanding the Dynamic Landscape of Smartphone App Us-

age

There has been many prior studies focusing on the understanding of smartphone

usage, which can be broadly divided into two categories based on the data collection

methodology. The UE (User Equipment) based collection requires mobile users’ consent

and participation. For example, [22] analyzed data from 77 participants in European

countries on a specific model of Nokia smartphone for 9 months; [24] studied traces from

43 mobile users with up to 5 months usage and broadened later in [25] to 255 users for up

to 28 weeks. The UE based approach is fundamentally limited in scale and the result can

be highly influenced by the participant samples — e.g., 16 out of the 33 Android users in

[25] being high school students may not represent the real distribution of user population.

On the other hand, the in-network measurement approach collects usage data within the

data communication networks. For example, [46] studied a one-week network trace in a

3G mobile network in a large metropolitan area; and [43] and [50] analyzed week-long

nation-wide traffic traces both taken from a major 3G cellular service provider in USA;

and [34] studied the usage behavior of mobile handheld devices connected through home

WiFi over residential DSL lines using 4 one-day traces for 20K subscribers spanning

11 months. Maintaining a large scale measurement and traffic processing system in any

operational network is tremendously challenging. Hence in-network measurement study

is often limited in duration, making it difficult to track the evolution of the smartphone

apps’ usage over time.

In this work, I analyze smartphone usage traces monitored in a major cellular

30

31

(3G/LTE) service provider in USA and in a regional residential high speed DSL ISP

network for over two years. The unparalleled scale of the measurement data provides

us the opportunity to understand the long term popularity changes of smartphone apps’

usage, the trending and seasonality of the usage patterns of individual apps, as well as

the capability to contrast how people are using their smartphones in the comfort of their

home versus while being mobile over the cellular network. Overall, my analysis provides

valuable insights for social scientists, network operators and content providers.

Due to the company policy, I cannot release the details of the analysis now.

3.2 Fast Mining of a Network of Coevolving Time Series

3.2.1 Introduction

Coevolving multiple time series are ubiquitous and naturally appear in a variety

of high-impact applications, ranging from environmental monitoring, computer network

traffic monitoring, motion capture, to physiological signaling in health care and many

more. In many scenarios, the multiple time series data is often accompanied by some con-

textual information in the form of networks. For example, in a smart building equipped

with various sensors, each sensor generates a time series by measuring a specific type

of room condition (e.g., temperature, humidity, etc.) over time. All these sensors or

time series are connected with each other by the underlying sensor network; in epilepsy

study, the various sensor measures collected from different brain regions of a patient are

essentially inter-connected by his or her brain graph; in scholarly article citation analy-

sis, the citations of different journals or conferences are correlated particularly for those

from similar domains. We refer to such multiple time series, together with its embedded

network as a network of coevolving time series.

Figure 3.1 presents an illustrative example of such time series. Figure 3.1(a) is a

simplified version of sensor deployment at an office floor. The weights on the labeled

32

(a) Sensor network

0 1000 2000 3000 4000 5000 600014

16

18

20

22

24

26

time

tem

pe

ratu

re

(b) Time series

Figure 3.1: An illustrative example of a network of time series. (a) A simplifiedsensor network. (b) Measured temperature time series, each of which correspondsto a node with the same color of the sensor network in (a). The multiple time seriesin (b) are inter-connected with each other by its embedded network in (a). (Bestviewed in color).

edges indicate the closeness and connectivity between two sensors. Figure 3.1(b) is two-

day temperature time series measured from (a)’s sensor network. The color is used to

differentiate different sensors and their corresponding time series. As can be seen, the

purple line and the cyan line are very close to each other, and the weight between their

sensors is relative high.

In order to unveil the underlying patterns of a network of coevolving time series, in

particular to fill in the missing values of one or more input time series, I adapt the concept

of collaborative filtering that multiple coevolving time series and/or the contextual net-

work are determined by a few latent factors, and propose DCMF, a dynamic contextual

matrix factorization algorithm to simultaneously find the common latent factors in both

the input time series and its embedded network.

In the last part of my proposal, I introduce a novel setting on mining a network of

multiple coevolving time series and propose a dynamic contextual matrix factorization

algorithm to find the missing values in a network of time series. The preliminary ex-

perimental result on a real sensor network dataset demonstrates the effectiveness of the

33

Table 3.1: Symbols and DefinitionsSymbol Definition and DescriptionA,... matrices (bold upper case)A

ij

the element at the ith row and jth column of matrix AA

j

the jth column of matrix AA(i, :) the ith row of matrix AA0 transpose of matrix Aa,b, ... column vectors (bold lower case)� element-wise multiplicationX observed time series (x1, ..,xT

)W indicator matrixS contextual matrixZ time series latent matrix (z1, ..., zT )B transition matrixU object latent matrixY contextual latent matrixn number of objects in X and ST length of time seriesl dimension of latent variables

proposed algorithm, which outperforms its competitors, especially when there are lots of

missing values.

3.2.2 Problem Definition

Table 3.1 lists the main symbols used throughout this work. Besides the standard

notations, a matrix Xn⇥T is used to denote observed time series with n dimensions and

T time steps, where Xit

denotes the data from ith object at time t. An indicator matrix

Wn⇥T is used to indicate data is observed or missing in X, i.e., Wit

= 0 when Xit

is

missing, otherwise Wit

= 1. A contextual matrix Sn⇥n is used to represent the embedded

network.

With the above notations, a Network of Time Series (NoT) can be formally defined

as follows:

Definition 3.2.1 A Network of Time Series (NoT). Given an n dimensional partially ob-

served time series with T time steps X, an indicator matrix W, an n⇥n contextual matrix

34

S representing the network behind X, and a one-to-one mapping function ⇣ , which maps

each dimension of the time series X to a node in S, a Network of Time Series is defined

as the quadruplet R = hX,W,S, ⇣i.

The problem of time series missing value recovery can be defined as follows:

Problem 3.2.1 NoT Missing Value Recovery.

Given: a network of time series R = hX,W,S, ⇣i;

Recover: its missing parts indicated by the indicator matrix W.

3.2.3 Dynamic Contextual Matrix Factorization

In this section, I present the solution for NoT Missing Value Recovery problem

(Problem 3.2.1). I start with the discussions of some baseline solutions and their limi-

tations, followed by our formulation to encode the temporal smoothness and the corre-

sponding algorithm.

3.2.3.1 Preliminaries and Challenges

I would like to point out that Problem 3.2.1 bears some similarity to the classic

collaborative filtering problem. To be specific, if the rows (time series) and columns

(time steps) of X are treated as users and items, respectively; the observed entries in X

as the ‘ratings’ between the corresponding user and item; and the contextual matrix S as

the ‘social network’ among different users, Problem 3.2.1 can be viewed as inferring the

missing or unknown ratings between the users (rows of X) and the items (columns of X).

Having this in mind, it seems that many collaborative filtering methods can be used to

solve Problem 3.2.1. For example, the matrix factorization based collaborative filtering

(Baseline Solution 3.2.1) [15, 30] can be used, if I only use the partially observed time

series matrix X. If I further incorporate the contextual matrix S, it is essentially a social

recommendation problem (Baseline Solution 3.2.2) [28, 32].

35

Baseline Solution 3.2.1 Collaborative Filtering (via Matrix Factorization)

Given: a rank size l, and a rating matrix Xn⇥m with missing values indicated by W,

where Xij

denotes user i’s rating on item j and Xij

= 0 if there is no rating,

Find: user latent factor Un⇥l and item latent factor Zl⇥m such that X ⇡ W � (UZ).

The rating predictions are estimated as the non-zero values of ˜Xn⇥m, where ˜X = (1 �

W)� (UZ).

Baseline Solution 3.2.2 Social Collaborative Filtering (via joint matrix factorization).

Given: a rank size l, and a rating matrix Xn⇥m with missing values indicated by W,

where Xij

denotes user i’s rating on item j and Xij

= 0 if there is no rating; a social

network matrix Sn⇥n, where Sij

indicates how much user i trust or is influenced by user

j,

Find: a user latent factor Un⇥l, an item latent factor Zl⇥m and a social latent factor

Vl⇥n such that X ⇡ W � (UZ) and S ⇡ UV. The rating predictions are estimated as

the non-zero values of ˜Xn⇥m, where ˜X = (1�W)� (UZ).

Baseline Solution 3.2.1 and 3.2.2 can be represented by the graphical models shown in

Figure 3.2(a) and (b). Both solutions are to find the similarities between users indicated

by the user latent factor U. Baseline Solution 3.2.1 only uses the rating matrix, while

Baseline Solution 3.2.2 fuses the rating matrix with the social matrix and finds the user

latent factor shared by both matrices. Such similarities between time series also exist in

many multiple time series applications. For example, in time series obtained from a sensor

network, nearby sensors may collect similar measurements; in water quality dataset, the

chlorine concentration levels of two connected pipes can be very close.

However, in both Baseline Solution 3.2.1 and 3.2.2, the temporal smoothness, a

unique characteristic of time series data, is ignored. For instance, the sensor data (e.g.,

temperature/light/humidity) or the chlorine concentration levels collected in two consec-

utive time steps might not change a lot. Ignoring such temporal smoothness is likely

36

to lead to sub-optimal or even infeasible solutions of Problem 3.2.1. For example, both

Baseline Solution 3.2.1 and 3.2.2 would fail if a certain number of columns of X are

missing entirely.

3.2.3.2 Proposed Formulation

I propose a novel algorithm, Dynamic Contextual Matrix Factorization (DCMF),

based on joint matrix factorization and linear dynamic systems, as illustrated in Fig-

ure 3.2(c).

Z

X

U

(a) Matrix Factorization

Z

X

U

S

V

(b) Joint Matrix Factorization

z1 z2 zt�1 zt zt+1

x1 x2 xt�1 xt xt+1

U

SV

(c) Dynamic Contextual Matrix Factorization

Figure 3.2: Comparison between the baseline solutions (a and b) and the proposedformulation (c).

To model the temporal smoothness, the time series X is re-written as column vec-

tors (x1,x2, ...,xT

), where each xt

represents n dimensional data at time t; and the cor-

responding hidden factors Z re-written as (z1, z2, ..., zT ). Then zt is defined to be linear

37

to zt�1 with a transition process B and multivariate transition noise !t

⇠ N (0,⇤). The

state for the latent variable at first time step z1 is defined by z0 with multivariate Gaussian

noise !0 ⇠ N (0, 0):

z1 = z0 + !0, zt

= Bzt�1 + !

t

, (3.1)

Similarly to the joint matrix factorization model, xt

is defined to be linear to zt

through the latent factor U with multivariate Gaussian noise ✏t

⇠ N (0,⌃), sj

to be linear

to vj

through the latent factor U with multivariate Gaussian noise ⌧j

⇠ N (0,⌅):

xt

= Wt

� (Uzt

+ ✏t

), (3.2)

sj

= Uvj

+ ⌧j

. (3.3)

Zero-mean spherical Gaussian priors are placed on V under the assumption that each

entry of S is scaled to [0, 1]:

p(V|�) =nY

j=1

N (vj

|0,�). (3.4)

In order to accommodate the proposed model in sparse time series datasets, the

multivariate Gaussian noises are simplified by assuming that the transition noises and

observation noises are independent and identically distributed (i.i.d.), and the covariances

of latent variable {vj

} are i.i.d., i.e. vj

⇠ N (0, �2V

), and

!t

⇠ N (0, �2Z

), ✏t

⇠ N (0, �2X

), ⌧j

⇠ N (0, �2S

),

where t = 1, ..., T and j = 1, ..., n.

With Eq. (3.1)-(3.4) and the parameters ✓ = {B, z0, 0, �2Z

, �2X

, �2S

, �2V

}, the goal

38

here is to maximize the joint distribution described as follows:

argmax

Z,U,V,✓

p(Z,X,V,S,U) = p(z1)TY

t=2

p(zt

|zt�1)

| {z }temporal smoothness

TY

t=1

p(xt

|zt

,U)

| {z }observed time series

⇥nY

j=1

p(vj

)

nY

j=1

p(sj

|vj

,U)

| {z }contextual network

, (3.5)

where the model parameters is omitted in the equation.

3.2.3.3 Proposed Algorithm

In Eq. (3.5), U and zt

jointly determine xt

, and U and vj

jointly determine S. Due

to such coupling, it is difficult to find its global optimal solution. Hence, we aim to find

its local optimal instead, by grouping {zt

,vj

} and updating U and {zt

,vj

} alternatively.

Regarding U as a parameter of the model, the solution of the proposed model contains

parameters ✓ = {U,B, z0, 0, �2Z

, �2X

, �2S

, �2V

} and the latent factors {zt}, {vj

}.

To be specific, the expectation-maximization (E-M) algorithm is adopted to find a

local optimal solution of the model as follows. The algorithm first randomly sets the initial

values of the parameters ✓old, then iteratively performs E-M steps until the convergence

criterion is satisfied. In the E step, it infers the posteriors of the latent variables {zt}, {vj

}

based on the old parameters ✓old. In the M step, it updates the new parameters ✓new that

maximize the expectation of the distribution likelihood.

E Step: updating the latent factors

Given U, {zt

,xt

} are independent with {vj

, sj

}. The posteriors of {zt

} and {vj

}

can be estimated independently w.r.t. U (steps 3-12 in Algorithm 2).

Eq. (3.1) and (3.2) are very similar to linear dynamic systems, except that only

observed entries in xt

contribute to Eq. (3.5). Let ot

denote the indices of the observed

entries of xt

, and x⇤t

denote the observed entries of xt

, which is a compressed version of

39

xt

, Ht

is used to represent the corresponding compressed version of U. Formally, ot

, x⇤t

and Ht

can be formulated as:

ot

= {i|Wit

> 0, i = 1, ..., n},

x⇤t

= xt

(ot

, :), Ht

= U(ot

, :) (3.6)

x⇤t

= Ht

zt

+ ✏t

Then, the forward-backward algorithm [16] or Kalman Filter and RTS smoother [41, 44]

is applied to find the posteriors of the latent variables {zt

}. Define

p(zt

|x1, ...,xt

) = N (zt

|µt

, t

),

p(zt

|{x1, ...,xT

) = N (zt

|ˆµt

, ˆ t

). (3.7)

By using the forwarding algorithm, the following equations can be derived:

µt

= Bµt�1 +K

t

(x⇤t

�Ht

Bµt�1)

t

= (I�Kt

Ht

)Pt�1

Pt�1 = B

t�1B0+ �2

Z

I

Kt

= Pt�1H

0t

(Ht

Pt�1H

0t

+ �2X

I)�1 (3.8)

with the initial status:

µ1 = z0 +K1(x⇤1 �H1z0)

1 = (I�K1H1) 0

K1 = 0H01(H1 0H

01 + �2

X

I)�1 (3.9)

40

Applying the backward algorithm, we have:

ˆµt

= µt

+ Jt

(

ˆµt+1 �Bµ

t

)

ˆ t

= t

+ Jt

(

ˆ t+1 �P

t

)J0t

,

Jt

= t

B0(P

t

)

�1 (3.10)

E[zt

] =

ˆµt

E[zt

z0t�1] =

ˆ t

J0t�1 + ˆµ

t

ˆµ0t�1

E[zt

z0t

] =

ˆ t

+

ˆµt

ˆµ0t

(3.11)

Then, Bayes’ theorem is applied on Eq. (3.3) and (3.4) to obtain the posteriors

p(vj

|sj

) = N (vj

|⌫j

,⌥):

⌫j

= M�1U0sj

, ⌥ = �2S

M�1

M = U0U+ ��2V

�2S

I

E[vj

] = ⌫j

, E[vj

v0j

] = ⌥+ ⌫j

⌫ 0j

. (3.12)

M Step: updating the model parameters

In the M step, the old parameters are updated to maximize the expectation of the

log likelihood (step 13 in Algorithm 2):

✓new = argmax

✓

Q(✓, ✓old)

Q(✓, ✓old) = EZ,V|✓old [ln p(Z,X,V,S|✓)], (3.13)

where p(Z,X,V,S|✓) is defined in Eq. (3.5).

To obtain ✓new, the probabilities in Eq. (3.5) are substituted with the defined dis-

tributions, and then the derivatives of Q(✓, ✓old) are calculated w.r.t. each parameter, and

41

finally a new parameter estimation is obtained by setting each derivative to zero. The

parameters are updated as follows:

znew0 = E[z1]

new

0 = E[z1z01]� E[z1]E[z01]

Bnew

=

TX

t=2

E[zt

z0t�1]

! TX

t=2

E[zt�1z

0t�1]

!�1

(�2V

)

new

=

1

nl

nX

j=1

tr

�E[v

j

v0j

]

�

(�2Z

)

new

=

1

(T � 1)ltr

TX

t=2

E[zt

z0t

]

�Bnew

TX

t=2

E[zt�1z

0t

]�TX

t=2

E[zt

z0t�1](B

new

)

0

+Bnew

TX

t=2

E[zt�1z

0t�1](B

new

)

0

!(3.14)

For U, each row U(i, :) is updated individually:

U(i, :)new = A1A�12

A1 = ��2S

nX

j=1

Sij

E[v0j

] + (1� �)��2X

TX

t=1

Wij

Xit

E[z0t

]

A2 = ��2S

nX

j=1

E[vj

v0j

] + (1� �)��2X

TX

t=1

Wit

E[zt

z0t

], (3.15)

where � is to trade off the contributions of the contextual matrix and time series. With

42

Unew, �2S

and �2X

are then updated as follows:

(�2S

)

new

=

1

n2

"nX

j=1

�s0j

sj

� 2s0j

(UnewE[vj

])

�

+tr

Unew

(

nX

j=1

E[vj

v0j

])(Unew

)

0

!#

(�2X

)

new

=

1

PT

t=1

Pn

i=1 Wit

TX

t=1

tr (Hnew

t

E[zt

z0t

](Hnew

t

)

0)

+

TX

t=1

((x⇤t

)

0x⇤t

� 2(x⇤t

)

0(Hnew

t

E[zt

]))

!, (3.16)

whereP

T

t=1

Pn

i=1 Wit

is the number of the observed entries in X and Hnew

t

is derived

from Unew based on Eq. (3.6).

The Overall Algorithm

Putting everything together, the overall algorithm (DCMF) to solve Eq. (3.5) is

summarized in Algorithm 2. Given a network of time series R, the dimension of latent

factors l, and the weight of contextual information �, the algorithm aims to approximate

the values of the latent factors U,Z,V. It first randomly initializes the model parameters

✓ (step 1); and then iteratively infers the expectations of Z, V (step 3-11), and updates

the model parameters ✓ including U (step 13) until converge. The missing values can be

inferred from the reconstructed time series ˆX (step 16).

3.2.4 Preliminary Result

3.2.4.1 Experimental Setup

A. Baseline Methods. The proposed method is compared with the following state-of-the-

art algorithms: Probabilistic Matrix Factorization (PMF) [35], Social Recommendation

(SoRec) [32], SmoothPMF, SmoothSoRec, and DynaMMo [31]. I also compare DMF,

which is a simplified version of DCMF when � = 0.

B. Motes Dataset. Motes Dataset is used to conduct the experiments. The Motes dataset

43

Algorithm 2: Dynamic Contextual Matrix Factorization (DCMF)Input : a network of time series R = hX,W,S, ⇣i,

dimension of latent factors l,weight of contextual information �

Output: U,Z,V1 Initialize ✓ = {U, B, z0, 0, �

2Z

, �2X

, �2S

, �2V

};2 repeat3 for t=1:T do4 Construct H

t

based on Eq. (3.6);5 Estimate µ

t

, t

using Eq. (3.8),(3.9)6 end7 for t=T:1 do8 Estimate ˆµ

t

, ˆ t

, E[zt

z0t�1],E[ztz0t] using Eq. (3.10), (3.11)

9 end10 for j=1:n do11 Estimate E[v

j

],E[vj

v0j

] using Eq. (3.12)12 end13 Update ✓ by Eq. (3.14) - (3.16)14 until converge;15 Set Z = (

ˆµ1, ..., ˆµT

), V = (⌫1, ...,⌫T

)

16 Reconstructed ˆX = UZ

consists of temperature measurement from the sensors deployed at the Intel Berkeley

Research Lab [5] over a month. The dataset also contains the sensors’ locations (x-y

coordinates) and averaged connectivity probabilities over time (i.e. the probability of a

message from a sensor successfully reaching another).

Let dij

and cij

be the two-dimensional Euclidean distance and the connectivity prob-

ability between Sensor i and Sensor j, respectively, each entry of the contextual matrix S

is defined as follows:

Sij

= ↵ · (1� dij

/max

i,j

(dij

)) + (1� ↵) · cij

,

where ↵ controls the contributions of the two parts, and is set to be 0.5 in the experiments.

C. Evaluation Metrics. To evaluate the effectiveness, the root mean square error (RMSE)

44

is used, which is defined as follows:

RMSE =

vuutP

i,t

(1�Wit

)(Xit

� ˆXit

)

2

Pi,t

(1�Wit

)

,

where Xit

is the observed value and ˆXit

is the corresponding reconstructed value, and W

is the indicator matrix. Wit

= 1 indicates that Xit

is an observed value (or is selected in

the training set), and Wit

= 0 indicates that Xit

is a missing value (or is selected in the

test set).

3.2.4.2 Result

To evaluate all the algorithms, I incrementally generate training and test sets of the

Motes dataset with an increasing amount of test data (1%, 10%, ..., 99.9%) within time

series. For example, to increase the size of test set from 1% to 10%, I randomly move 9%

of the observed data from the training set to the test set. In this way, the subsequent test

set always contains the test data of the previous one.

Figure 3.3 shows the results on the Motes dataset. As can be seen, the proposed

DCMF algorithm consistently outperforms the other methods, especially when the per-

centage of the missing values is large.

3.2.5 Summary

In this section, a novel setting on mining a network of multiple coevolving time

series is introduced and a Dynamic Contextual Matrix Factorization (DCMF) algorithm is

proposed to infer the missing values from the input time series. The proposed algorithm is

comprehensive, simultaneously capturing (a) the correlation among different time series;

(b) the contextual information encoded by the contextual network; and (c) the temporal

smoothness along the time dimension.

45

10 70 95 99.9 99.950

5

10

15

20

25

missing value%

RM

SE

DCMFDMFdynaMMoSmoothSoRecSmoothPMFSoRecPMF

Figure 3.3: Evaluation on missing value recovery in Motes Dataset (n=54,T=14400).Lower is better. DCMF outperforms all of the competitors.

CHAPTER 4

Proposed Work and Timeline

4.1 Outline of Proposed Work

My proposal focus on network monitoring and data analysis in wireless networks.

In the first part, I explore security monitoring in 802.11 based networks. I first identify

the challenges of security monitoring for network forensics, especially in the metropolitan

area and propose a distributed security monitoring system for wireless network forensics

- SMoWF (Chapter 2.1).

Based on the ubiquity of WiFi networks and mobile devices, I then propose a phys-

ical intruder detection system- Beagle (Chapter 2.2). It is low-cost and is easy to de-

ploy comparing to current expensive and complex home security systems. In Beagle,

the proposed MAC-based detection algorithm works for the intruders carrying with WiFi

devices. I aim to design and implement the RSSI-variance-based method to detect the

intruders without carrying WiFi devices or with their WiFi networks turned off. I plan

to study RSSI distributions in different time windows using histograms, and characterize

abnormal distributions to infer the presence of an intruder. In addition, I will leverage

neighbors’ APs in the algorithm to extend the sensing capability.

In the second part, I look into the analysis of the big network data, resulting from the

fast advances of wireless network technologies. I first focus on the study of smartphone

apps’ usage patterns, and analyze long-term anonymized traffic data for app usages. The

analysis provides valuable insights on the usage patterns and trend changes of a large

variety of smartphone apps (Chapter 3.1).

Then, I propose a Dynamic Contextual Matrix Factorization algorithm (DCMF) to

model a network of coevolving time series, which appear in many high-impact applica-

46

47

tions, especially in wireless sensor networks (Chapter 3.2). There are two tunable pa-

rameters in the proposed algorithm. Naturally, a next step is to automatically learn these

two parameters. Furthermore, except for missing value recovery, prediction and anomaly

detection are two important tasks in the applications of wireless sensor networks. I aim

to adapt the DCMF algorithm to accomplish these two tasks. In addition, it is observed

that lag correlations frequently exist in coevolving time series, which makes it necessary

to extend the DCMF algorithm to capture lag effects. Finally, it is very common that

network topology changes very often in wireless sensor networks. I would like to extend

the DCMF algorithm to make it work for the time series data from dynamic networks.

4.2 Timeline for Completion

The work plan to complete the dissertation is as follows:

1. November 2014 - December 2014, Design and Implementation of the RSSI-variance-

based detection method for the Beagle system.

2. January 2015 - April 2015,

(a) Analyze the efficiency of the DCMF algorithm and evaluate the DCMF algo-

rithm in terms of missing value recovery on multiple real datasets.

(b) Adapt the DCMF algorithm for prediction and complete the experimental

evaluations.

(c) Automatically learn the dimension of latent factors.

(d) Extend the DCMF algorithm to capture lag effects in a network of time series.

(e) Extend the DCMF algorithm for dynamic networks.

3. April 2015 - May 2015, Write the dissertation and prepare for defense.

4. Finally, Defend my thesis in May 2015.

CHAPTER 5

Conclusion

My proposal is dedicated to address the challenges in network measurement and man-

agement. In the first part of this proposal, I investigate the impact of wireless network

(specifically wireless local area network) on security monitoring. I first identify the secu-

rity challenge resulted from the easy-to-access WiFi networks and propose a distributed

security monitoring system for wireless network forensics. Then, I explore how WiFi

networks can be used to improve home security. Specifically, I aim to design and de-

velop a low-cost home security system to detect unexpected physical intruders based on

existing WiFi infrastructures. In the second part of this proposal, I focus on analyzing the

big data that wireless networks produce. Specifically, I investigate effective and efficient

approaches in analyzing network data generated by smartphone apps and wireless sensor

networks.

48

REFERENCES

[1] Cisco visual networking index: Forecast and methodology, 20132018.http://www.cisco.com/c/en/us/solutions/collateral/

service-provider/ip-ngn-ip-next-generation-network/

white_paper_c11-481360.pdf.

[2] Euclid analytics. http://euclidanalytics.com/.

[3] Haversine formula.http://en.wikipedia.org/wiki/haversine_formula.

[4] Kismet. https://www.kismetwireless.net/.

[5] Motes dataset. http://db.csail.mit.edu/labdata/labdata.html.

[6] Number of landline subscribers.http://www.itu.int/en/ITU-D/Statistics/Documents/

statistics/2014/Fixed_tel_2000-2013.xls.

[7] Presence orb. http://www.presenceorb.com/.

[8] This recycling bin is following you. http://qz.com/112873/this-recycling-bin-is-following-you/.

[9] tshark.https://www.wireshark.org/docs/man-pages/tshark.html.

[10] Turnstyle solutions. https://www.getturnstyle.com/.

[11] Wireless sensor network.http://en.wikipedia.org/wiki/Wireless_sensor_network.

[12] Paramvir Bahl, Jitendra Padhye, Lenin Ravindranath, Manpreet Singh, AlecWolman, and Brian Zill. Dair: A framework for managing enterprise wirelessnetworks using desktop infrastructure. In Proceedings of the Annual ACMWorkshop on Hot Topics in Networks (HotNets), 2005.

[13] Paramvir Bahl and Venkata N Padmanabhan. Radar: An in-building rf-based userlocation and tracking system. In INFOCOM 2000. Nineteenth Annual JointConference of the IEEE Computer and Communications Societies. Proceedings.IEEE, volume 2, pages 775–784. Ieee, 2000.

49

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.pdf



http://euclidanalytics.com/

http://en.wikipedia.org/wiki/haversine_formula

https://www.kismetwireless.net/

http://db.csail.mit.edu/labdata/labdata.html

http://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2014/Fixed_tel_2000-2013.xls

http://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2014/Fixed_tel_2000-2013.xls

http://www.presenceorb.com/

http://qz.com/112873/this-recycling-bin-is-following-you/

http://qz.com/112873/this-recycling-bin-is-following-you/

https://www.wireshark.org/docs/man-pages/tshark.html

https://www.getturnstyle.com/

http://en.wikipedia.org/wiki/Wireless_sensor_network

50

[14] Anand Balachandran, Geoffrey M. Voelker, Paramvir Bahl, and P. Venkat Rangan.Characterizing user behavior and network performance in a public wireless lan.SIGMETRICS Perform. Eval. Rev., 30(1):195–205, June 2002.

[15] Robert Bell, Yehuda Koren, and Chris Volinsky. Modeling relationships at multiplescales to improve accuracy of large recommender systems. In KDD, pages 95–104.ACM, 2007.

[16] Christopher M Bishop et al. Pattern recognition and machine learning, volume 1.springer New York, 2006.

[17] Yongjie Cai and Ping Ji. A measurement study for understanding wireless forensicmonitoring. In ICDFI, 2012.

[18] Xi Chen, Andrea Edelstein, Yunpeng Li, Mark Coates, Michael Rabbat, andAidong Men. Sequential monte carlo for simultaneous passive device-free trackingand sensor localization using received signal strength measurements. InInformation Processing in Sensor Networks (IPSN), 2011 10th InternationalConference on, pages 342–353. IEEE, 2011.

[19] Yu-Chung Cheng, John Bellardo, Peter Benko, Alex C Snoeren, Geoffrey MVoelker, and Stefan Savage. Jigsaw: solving the puzzle of enterprise 802.11analysis, volume 36. ACM, 2006.

[20] Yu-Chung Cheng, Yatin Chawathe, Anthony LaMarca, and John Krumm.Accuracy characterization for metropolitan-scale wi-fi localization. In Proceedingsof the 3rd international conference on Mobile systems, applications, and services,pages 233–245. ACM, 2005.

[21] Krishna Chintalapudi, Anand Padmanabha Iyer, and Venkata N Padmanabhan.Indoor localization without the pain. In Proceedings of the sixteenth annualinternational conference on Mobile computing and networking, pages 173–184.ACM, 2010.

[22] Trinh Minh Tri Do, Jan Blom, and Daniel Gatica-Perez. Smartphone usage in thewild: a large-scale analysis of applications and context. In Proceedings of the 13thinternational conference on multimodal interfaces, ICMI ’11, pages 353–360, NewYork, NY, USA, 2011. ACM.

[23] David Eckhardt and Peter Steenkiste. Measurement and analysis of the errorcharacteristics of an in-building wireless network. In ACM SIGCOMM Computercommunication review, volume 26, pages 243–254. ACM, 1996.

[24] Hossein Falaki, Dimitrios Lymberopoulos, Ratul Mahajan, Srikanth Kandula, andDeborah Estrin. A first look at traffic on smartphones. In Proceedings of the 10thACM SIGCOMM conference on Internet measurement, IMC ’10, pages 281–287,New York, NY, USA, 2010. ACM.

51

[25] Hossein Falaki, Ratul Mahajan, Srikanth Kandula, Dimitrios Lymberopoulos,Ramesh Govindan, and Deborah Estrin. Diversity in smartphone usage. InProceedings of the 8th international conference on Mobile systems, applications,and services, MobiSys ’10, pages 179–194, New York, NY, USA, 2010. ACM.

[26] Camden C Ho, Krishna N Ramachandran, Kevin C Almeroth, and Elizabeth MBelding-Royer. A scalable framework for wireless network monitoring. InProceedings of the 2nd ACM international workshop on Wireless mobileapplications and services on WLAN hotspots, pages 93–101. ACM, 2004.

[27] Jun Hou, Chengdong Wu, Zhongjia Yuan, Jiyuan Tan, Qiaoqiao Wang, and YunZhou. Research of intelligent home security surveillance system based on zigbee.In Intelligent Information Technology Application Workshops, 2008. IITAW’08.International Symposium on, pages 554–557. IEEE, 2008.

[28] Meng Jiang, Peng Cui, Rui Liu, Qiang Yang, Fei Wang, Wenwu Zhu, and ShiqiangYang. Social contextual recommendation. In CIKM, pages 45–54, New York, NY,USA, 2012. ACM.

[29] Yoon-Gu Kim, Han-Kil Kim, Suk-Gyu Lee, and Ki-Dong Lee. Ubiquitous homesecurity robot based on sensor network. In Proceedings of the IEEE/WIC/ACMInternational Conference on Intelligent Agent Technology, IAT ’06, pages700–704, Washington, DC, USA, 2006. IEEE Computer Society.

[30] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniquesfor recommender systems. Computer, 42(8):30–37, 2009.

[31] Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. Dynammo:Mining and summarization of coevolving sequences with missing values. In KDD,pages 507–516. ACM, 2009.

[32] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Sorec: socialrecommendation using probabilistic matrix factorization. In CIKM, pages931–940. ACM, 2008.

[33] Ratul Mahajan, Maya Rodrig, David Wetherall, and John Zahorjan. Analyzing themac-level behavior of wireless networks in the wild. In ACM SIGCOMMComputer Communication Review, volume 36, pages 75–86. ACM, 2006.

[34] Gregor Maier, Fabian Schneider, and Anja Feldmann. A first look at mobilehand-held device traffic. In Proceedings of the 11th international conference onPassive and active measurement, PAM’10, pages 161–170, Berlin, Heidelberg,2010. Springer-Verlag.

[35] Andriy Mnih and Ruslan Salakhutdinov. Probabilistic matrix factorization. InAdvances in neural information processing systems, pages 1257–1264, 2007.

52

[36] Neal Patwari and Joey Wilson. Rf sensor networks for device-free localization:Measurements, models, and algorithms. Proceedings of the IEEE,98(11):1961–1973, 2010.

[37] Nissanka Bodhi Priyantha. The cricket indoor location system. PhD thesis,Massachusetts Institute of Technology, 2005.

[38] Feng Qian, Zhaoguang Wang, Alexandre Gerber, Zhuoqing Mao, Subhabrata Sen,and Oliver Spatscheck. Profiling resource usage for mobile applications: across-layer approach. In Proceedings of the 9th international conference on Mobilesystems, applications, and services, MobiSys ’11, pages 321–334, New York, NY,USA, 2011. ACM.

[39] Mika Raento, Antti Oulasvirta, and Nathan Eagle. Smartphones: An EmergingTool for Social Scientists. Sociological Methods Research, 37(3):426–454,February 2009.

[40] Theodore S Rappaport et al. Wireless communications: principles and practice,volume 2. prentice hall PTR New Jersey, 1996.

[41] Herbert E Rauch, CT Striebel, and F Tung. Maximum likelihood estimates oflinear dynamic systems. AIAA journal, 3(8):1445–1450, 1965.

[42] David Schwab and Rick Bunt. Characterising the use of a campus wirelessnetwork. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEEComputer and Communications Societies, volume 2, pages 862–870. IEEE, 2004.

[43] M Zubair Shafiq, Lusheng Ji, Alex X Liu, and Jia Wang. Characterizing andmodeling internet traffic dynamics of cellular devices. In Proceedings of the ACMSIGMETRICS joint international conference on Measurement and modeling ofcomputer systems, pages 305–316. ACM, 2011.

[44] John Z Sun, Kush R Varshney, and Karthik Subbian. Dynamic matrix factorization:A state space approach. In ICASSP, pages 1897–1900. IEEE, 2012.

[45] Diane Tang and Mary Baker. Analysis of a local-area wireless network. InProceedings of the 6th annual international conference on Mobile computing andnetworking, pages 1–10. ACM, 2000.

[46] Ionut Trestian, Supranamaya Ranjan, Aleksandar Kuzmanovic, and Antonio Nucci.Measuring serendipity: connecting people, locations and interests in a mobile 3gnetwork. In Proceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference, IMC ’09, pages 267–279, New York, NY, USA, 2009.ACM.

[47] Chieh-Chih Wang, Charles Thorpe, Sebastian Thrun, Martial Hebert, and HughDurrant-Whyte. Simultaneous localization, mapping and moving object tracking.The International Journal of Robotics Research, 26(9):889–916, 2007.

53

[48] Joey Wilson and Neal Patwari. Radio tomographic imaging with wirelessnetworks. Mobile Computing, IEEE Transactions on, 9(5):621–632, 2010.

[49] Joey Wilson and Neal Patwari. See-through walls: Motion tracking usingvariance-based radio tomography networks. Mobile Computing, IEEE Transactionson, 10(5):612–621, 2011.

[50] Qiang Xu, Jeffrey Erman, Alexandre Gerber, Zhuoqing Mao, Jeffrey Pang, andShobha Venkataraman. Identifying diverse usage behaviors of smartphone apps. InProceedings of the 2011 ACM SIGCOMM conference on Internet measurementconference, IMC ’11, pages 329–344, New York, NY, USA, 2011. ACM.

[51] Yang Zhao, Neal Patwari, Jeff M Phillips, and Suresh Venkatasubramanian. Radiotomographic imaging and tracking of stationary and moving people via kerneldistance. In Proceedings of the 12th international conference on Informationprocessing in sensor networks, pages 229–240. ACM, 2013.

NETWORK MONITORING AND DATA ANALYSIS IN ......wireless network forensics. Then, I explore how WiFi...

Documents

Transcript of NETWORK MONITORING AND DATA ANALYSIS IN ......wireless network forensics. Then, I explore how WiFi...