NETWORK MONITORING AND DATA ANALYSIS IN ......wireless network forensics. Then, I explore how WiFi...
Transcript of NETWORK MONITORING AND DATA ANALYSIS IN ......wireless network forensics. Then, I explore how WiFi...
NETWORK MONITORING AND DATA ANALYSIS INWIRELESS NETWORKS
By
Yongjie Cai
A Dissertation Proposal Submitted to the Graduate
Faculty in Computer Science
in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY,
The Graduate Center, City University of New York
2014
Doctoral Committee:
Prof. Ping Ji, City University of New YorkProf. Spiros Bakiras, City University of New YorkProf. Changhe Yuan, City University of New YorkProf. Hanghang Tong, Arizona State University
c� Copyright 2014
by
Yongjie Cai
All Rights Reserved
ii
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Network Measurement in Security Monitoring of WiFi Networks . . . . . 31.1.1 Security Monitoring for Wireless Network Forensics . . . . . . . 31.1.2 Wireless Monitoring for Home Security . . . . . . . . . . . . . . 4
1.2 Data Analysis: Pattern Discovery and Missing Value Recovery . . . . . . 51.2.1 Usage Pattern Discovery in Smartphone Apps . . . . . . . . . . . 51.2.2 Missing Value Recovery in Wireless Sensor Network . . . . . . . 6
1.3 Structure of This Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Security Monitoring in WiFi Networks . . . . . . . . . . . . . . . . . . . . . . 8
2.1 SMoWF: Security Monitoring for Wireless Network Forensics . . . . . . 82.1.1 Overview of SMoWF . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Traffic Capture and Preprocess . . . . . . . . . . . . . . . . . . . 112.1.3 Evidence Preservation and Post-investigation . . . . . . . . . . . 122.1.4 Experiments and System Prototype . . . . . . . . . . . . . . . . 132.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Beagle: Physical Intruder Detection using WiFi networks . . . . . . . . . 182.2.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . 192.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 MAC-based Detection . . . . . . . . . . . . . . . . . . . . . . . 212.2.4 RSSI-variance-based Detection . . . . . . . . . . . . . . . . . . 272.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. Pattern Discovery and Missing Value Recovery in Data Analysis . . . . . . . . 30
3.1 Understanding the Dynamic Landscape of Smartphone App Usage . . . . 30
3.2 Fast Mining of a Network of Coevolving Time Series . . . . . . . . . . . 313.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 33
iii
3.2.3 Dynamic Contextual Matrix Factorization . . . . . . . . . . . . . 343.2.4 Preliminary Result . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4. Proposed Work and Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Outline of Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Timeline for Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
LIST OF TABLES
2.1 Study of Popular Smartphones’ Probing Behaviors. . . . . . . . . . . . . . 23
3.1 Symbols and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
LIST OF FIGURES
1.1 Task Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 A typical wireless monitoring system . . . . . . . . . . . . . . . . . . . . . 9
2.2 Testbed and Walking Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Distribution of Observed 802.11 Packet Types. . . . . . . . . . . . . . . . . 14
2.4 Average Error Distances for Device Localization. . . . . . . . . . . . . . . 15
2.5 System Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Active Scanning Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Regular Used Smartphones’ Waiting Times in Worst Case Scenarios. . . . . 23
2.8 Study of signal strength from probe message. . . . . . . . . . . . . . . . . . 24
2.9 MAC-based Detection: Detecting intruders carrying WiFi devices. . . . . . 25
2.10 RSSI-variance-based Detection: Detecting intruders without carrying WiFidevices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.11 Effects of moving people on RSSI of Beacons (x axis is time in second, yaxis is RSSI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 An illustrative example of a network of time series. (a) A simplified sensornetwork. (b) Measured temperature time series, each of which correspondsto a node with the same color of the sensor network in (a). The multiple timeseries in (b) are inter-connected with each other by its embedded network in(a). (Best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Comparison between the baseline solutions (a and b) and the proposed for-mulation (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Evaluation on missing value recovery in Motes Dataset (n=54,T=14400).Lower is better. DCMF outperforms all of the competitors. . . . . . . . . . 45
vi
ABSTRACT
Various wireless network technologies have been created to meet the ever-increasing de-
mand for wireless access to the Internet, such as wireless local area network, cellular
network and sensor network and many more. The communication devices have trans-
formed from large computational servers to small wireless hand-held devices, ranging
from laptops, tablets, smartphones to small sensors. The advances of these wireless net-
works (e.g., faster network speed) and their intensive usages result in an enormous growth
of network data in terms of volume, diversity, and complexity. All of these changes have
raised complicated network measurement and management issues.
In this proposal, I first investigate the impact of wireless local area network in home
and network security monitoring. Then I propose effective and efficient approaches in
analyzing network data, particularly those generated by smartphone apps and sensor net-
works.
vii
CHAPTER 1
Introduction
Various wireless network technologies have been created to meet the ever-increasing de-
mand for wireless access to the Internet, such as wireless local area network, cellular net-
work and sensor network and many more. The communication devices have transformed
from large computational servers to small wireless hand-held devices, ranging from lap-
tops, tablets, smartphones to small sensors. It is reported that wireless traffic accounted
for the majority of IP traffic at 44 percent in the last year. It is expected that traffic from
wireless and mobile devices will exceed traffic from wired devices by 2018, and by 2018,
wired devices will account for 39 percent of IP traffic, while WiFi and mobile devices
will account for 61 percent of IP traffic [1].
With easy deployment and low costs, WiFi networks provide Internet access at
homes, university and corporate campuses, and public places such as libraries, parks, and
coffee shops. While people enjoy the convenience of connecting electronic devices, such
as laptops, tablets, smartphones, video-game consoles and digital cameras to the Internet
via wireless access points, the wireless channels are prone to different types of attacks and
subject to performance problems such as congestions and high packet losses all the time.
Therefore, monitoring network activities in Wireless Networks to facilitate security and
network performance management is an important area of research. In my work, I will
explore the infrastructure and protocol design of network security monitoring for WiFi
networks.
Another type of fast evolving wireless networks is Cellular Network. With a rapid
pace, traditional landline telephones are being replaced by Voice over Internet Protocol
(VoIP) phones or cellphones. With the amazing smartphone technologies, the younger
generations have almost dumped landline phones (with either PSTN or VoIP service)
1
2
completely. Many statistics have shown that a significantly less number of households
own landline phones, and smartphones and other wireless enabled smart devices have
claimed a dominate role in remote voice communication [6]. It is also an irreversible
trend that the smartphones are replacing desktop computers or even laptops. We have seen
students doing their homework on smartphones, commercial stores swiping customers’
credit cards on smartphones and doctors writing prescriptions on smartphones. There-
fore, understanding how people use their smartphones and the applications is valuable for
social scientists to study human behaviors, and provides insight for app developers, data
service carriers and content providers from both technical and commercial perspectives.
In addition, a third type of wireless networks - Sensor Network has become an
important component in data communications as well. Research papers and practical
projects have been proposed and conducted in various types of application scenarios for
sensor networks. It ranges from networks of sensors with single and simple modalities
(e.g., temperature, air pressure, acoustic) to complicated networks of sensors that provide
data of mixed modalities and continuous readings. For example, sensor networks are
deployed to monitor the concentration of dangerous gases in the cities, detect forest fires,
monitor water quality in industries, measure temperature, humanity, and light both in
industries and residential houses, etc. [11]. These networks generate a large amount of
complex data in the format of coevolving multiple time series, which is very challenging
to manage. In this thesis, I will investigate the data characteristics of such complicated
sensor networks.
As Figure 1.1 shows, my thesis work focuses on the network measurement and
data analysis perspectives of three different types of wireless networks, which are tightly
related to people’s life.
3
People’s Life
WiFi Networks
Security Monitoring
Wireless Network ForensicsChapter 2.1
Home SecurityChapter 2.2
Cellular Networks(Smartphone Apps)
Usage Behavior StudyChapter 3.1
Wireless Sensor Networks
Time Series ModelingChapter 3.2
Figure 1.1: Task Breakdown
1.1 Network Measurement in Security Monitoring of WiFi Networks
The 802.11 based wireless network (known as WiFi) works as a ubiquitous com-
munication infrastructure because of its capabilities of low cost, easy deployment, high
network bandwidth and its inherent convenience for mobile devices. Many places such as
university campuses, corporate and government offices, public parks, or even entire urban
cities are blanketed with Internet access through numerous access points (APs).
In the first part of this proposal, I investigate the impact of wireless local area net-
work on security monitoring. I first identify the challenges in security monitoring resulted
from prevalent WiFi networks and propose a distributed security monitoring system for
wireless network forensics. Then, I explore how WiFi networks can be used to improve
home security.
1.1.1 Security Monitoring for Wireless Network Forensics
As wireless networks and mobile devices are claiming dominant roles in modern
communication technology, the manifestations of digital crimes have started integrating
wireless elements as well, which makes digital forensic investigation unprecedentedly
difficult. For example, a hacker may drive on the street, randomly pick an open WiFi
network, conveniently connect to the access point, upload or download malicious files
4
through the access point, then close the session and drive away. The whole process may
only take minutes to accomplish, and when the victim machine notices the attack, the best
point of interest that it can trace back is very likely only the benign access point, through
which the true attacker conducted the malicious activity. It is almost always certain that
the hacker will be cut loose.
In this work, a distributed Security Monitoring system for Wireless network Foren-
sics (SMoWF) is proposed to monitor Wireless LAN activities. Abstracts of network
traces are captured and selectively recorded at each monitoring point. Distributed mon-
itoring points collaborate to reconstruct the crime scene based on stored logs, and the
SMoWF system aims to answer an important investigation question: which device ap-
peared at where during what time.
1.1.2 Wireless Monitoring for Home Security
Home security is a big concern to everyone. Current home security systems rely on
many sensors [27, 29], such as door and window sensors, motion sensors, heat sensors,
etc. It is complex to install tens of such sensors. The cost to purchase and maintain these
sensors would be very high. Hence, many home owners are not willing to equip their
homes with security systems.
On the other hand, smartphones with WiFi turned on actively send probe request
frames with MAC addresses to search for available WiFi networks. MAC addresses are
already used to profile and track actual and potential shoppers and customers for stores
by commercial companies [2,7,10]. In addition, the signal propagations of wireless links
would be affected by people moving around, which motivates lots of research work on
detecting and tracking moving objects using wireless signals [18, 36, 47–49, 51].
Based on these two facts, I propose to design and develop a low-cost home secu-
rity system called Beagle to detect unexpected physical intruders, prevent property loss
and keep homes safe. Unlike traditional home security systems relying on placing sets of
5
sensors inside the house, Beagle is based on existing WiFi infrastructures with ubiquitous
WiFi signals penetrating the walls and thus extending our senses. With all the functional-
ities including monitoring, detection, and alarming built into one box, it is easy for users
to deploy and use the Beagle system.
1.2 Data Analysis: Pattern Discovery and Missing Value Recovery
Along with the fast advances of various wireless network technologies, the network
data have increased dramatically in terms of volume, diversity, and complexity. In the
second part of this proposal, I investigate data analysis approaches that are effective and
efficient in examining network data generated by smartphone apps and sensor networks.
1.2.1 Usage Pattern Discovery in Smartphone Apps
The use of smartphones has gone through a booming growth over the past few years
and become an integral part of people’s daily life. Through a plethora of smartphone apps,
people are indulging in entertainment, staying connected in social life, acquiring news
and information, conducting business activities, and managing nearly every aspects of
personal life — health, travel, education, finance, shopping, banking, dating, etc. Under-
standing how people use their smartphones and apps, the prevalence of various kinds of
smartphone apps, and their usage patterns and trends, is a great way for social scientists to
learn about the modern society [39]. It can also provide valuable insights for application
developers, service carriers and content providers to deliver a better user experience [38].
In this study, I analyze two-year long anonymized traffic data of smartphone us-
ages from both a major US-based cellular service provider and a large regional residential
DSL service provider. The analysis provides valuable insights on the usage patterns and
trend changes of a large variety of smartphone apps that people use nowadays. I compare
and contrast the usage patterns when smartphone users are mobile (accessing via cellular
network) and when they are at home (accessing through home WiFi). I also discover in-
6
teresting phenomena such as a version upgrade of an app resulting in a disruptive pattern
in traffic volume but not in its distinct user number. I further develop an automated alert-
ing system of such disruptive patterns for cellular service providers, allowing a timely
attention and response to the potential performance problems that they introduce.
1.2.2 Missing Value Recovery in Wireless Sensor Network
Wireless sensor networks have become an essential part of computer networks and
have been applied in many areas, ranging from environmental monitoring, industrial mon-
itoring, home security to physiological signaling in health care applications and many
more. These applications usually deploy a number of sensor nodes to collaboratively ac-
complish their tasks. The sensor nodes are prone to failures, and the topology of a sensor
network might change frequently. All of these characteristics of wireless sensor networks
make the analysis management of sensor data very challenging. In this work, I focus on
modeling data generated from wireless sensor networks, specifically, data in the format
of coevolving multiple time series with missing values.
In many scenarios, the multiple time series data is often accompanied by some con-
textual information in the form of networks. For example, in a smart building equipped
with various sensors, each sensor generates a time series by measuring a specific type
of room condition (e.g., temperature, humidity, etc.) over time. All these sensors or
time series are connected with each other by the underlying sensor network; in epilepsy
study, the various sensor measures collected from different brain regions of a patient are
essentially inter-connected by his or her brain graph; in scholarly article citation analy-
sis, the citations of different journals or conferences are correlated particularly for those
from similar domains. Such multiple time series, together with its embedded network are
referred as a network of coevolving time series.
In order to unveil the underlying patterns of a network of coevolving time series, in
particular to fill in the missing values of one or more input time series, I adapt the concept
7
of collaborative filtering that multiple coevolving time series and/or the contextual net-
work are determined by a few latent factors, and propose DCMF, a dynamic contextual
matrix factorization algorithm to simultaneously find the common latent factors in both
the input time series and its embedded network.
1.3 Structure of This Proposal
The rest of this proposal is organized as follows: I presents the studies of security
monitoring in WiFi Networks in Chapter 2. In Chapter 3, I study usage behaviors of
smartphone apps, and propose an effective and efficient algorithm to model coevolving
multiple time series generated by sensor networks. Then, I describes the proposed work
motivated by the preliminary results of Chapter 2 and Chapter 3, and presents the time-
line to complete the work in Chapter 4. Finally, I make conclusions of this proposal in
Chapter 5.
CHAPTER 2
Security Monitoring in WiFi Networks
2.1 SMoWF: Security Monitoring for Wireless Network Forensics
Cybercrime is an exploding security challenge in the current digital age, and has
been largely concerned by the public over the past several decades. With the escalating
deployment of WiFi networks, the accelerated usage of mobile devices, and the dynamic
physical and protocol characteristics of wireless communication, wireless links have be-
come an increasingly popular channel for cyber criminals to camouflage their true iden-
tities. For example, a malicious hacker may walk on the street, randomly pick an open
Access Point (AP) such as one from a coffee shop, conveniently connect to the AP, upload
and/or download malicious files or send commands through the AP to Zombie machines
compromised for a distributed attack, then close the communication session and take off
freely. The whole process may only take minutes to accomplish. No one would even
notice what has happened until the victim device(s) detects an intrusion. When an at-
tack has been identified, the investigators will easily run into the situation where the best
point-of-interest that can be traced back is the benign access point, through which the true
attacker conducted the malicious activity. Since most WiFi users do not regularly keep
security logs and monitor the activities of their wireless network, it is almost certain that
the hacker is off the hook.
An intuitive way is to ask users to keep SNMP logs in APs and record partial traffic
statistics such as application workloads, and user sessions by some traditional measure-
ment methods applied on the wired part of WiFi network traffic [14,23,42,45]. However,
only limited knowledge can be obtained in the wireless part (PHY/MAC layers under IP
layer) through such wired measurement efforts. Researchers also have proposed several
8
9
wireless monitoring infrastructure systems, primarily for improving wireless channel and
protocol performance. Figure 2.1 demonstrates a framework of a typical wireless mon-
itoring system, which consists of three main components: wireless network monitoring
points, a data repository and a centric processing engine. Each monitoring point gathers
network information by passively capturing traffic in the air, and transmits the gathered
raw data to the repository. The centric processing engine conducts network analysis and
reports events of interest to network operators. DAIR [12], Jigsaw [19] and Wit [33]
are three such systems built on this infrastructure. These two monitoring methods might
work in the organizations with network specialists, such as corporates and universities.
However, it is impractical to expect users to do similar network administrations at home.
!
Monitoring!Point! Monitoring!Point!
Repository!Centric!
Processing!Engine!
!
User!Interaction!
…!
Figure 2.1: A typical wireless monitoring system
In this work, I propose to design a Security Monitoring system for Wireless Net-
work Forensics (SMoWF), which aims to keep trace digests of WiFi activities and to an-
swer the important investigation question: which device appeared at where during what
time.
2.1.1 Overview of SMoWF
The emerging and increasing growth of WiFi wireless networking technology makes
it possible to connect to the Internet from anywhere at anytime. For example, in a wire-
less network measurement study [17], we conducted experiments around a three-block
metropolitan neighborhood of the upper-west side of Manhattan, and observed over 3000
access points in the neighborhood. The densely deployed WiFi networks are undoubtedly
10
making our life much easier and more enjoyable, but they also provide more opportunities
for malicious users to conduct criminal activities through mobile devices. We notice that
among our observed access points, about 30 percent provide unencrypted WiFi services.
In other words, these open networks can be easily compromised. In this work, I propose
a security monitoring infrastructure for wireless network forensics (SMoWF), which is
to build an intelligent monitoring system that can uncover malicious devices, track their
activities in Wireless LAN of metropolitan areas, and preserve digital evidence to facili-
tate future cyber crime investigation. The SMoWF system should be able to answer the
following questions:
• Whether or not a particular device was involved in a given malicious network ac-
tivity?
• Can this device be uniquely identified by the logs?
• Where was a particular device physically located during a given event?
Similar to Figure 2.1, the SMoWF system consists of a set of monitoring points that are
responsible to capture wireless network traffic. These monitoring points are distributed
through a wireless network and may be moved around to cover Wireless LANs as much
as possible. After the collection of raw traffic data, SMoWF parses raw data into human-
readable texts, eliminates irrelevant traffic types and extracts useful information for device
identification and localization. It also removes the payload (i.e., data part) of traffic pack-
ets to protect users’ privacy. SMoWF uses a central repository to store processed data as
digital evidence. Finally, it includes a post-investigation engine that helps investigators to
figure out what was going on when a criminal activity occurred. The post-investigation
engine retrieves relevant data from the evidence repository and is able to answer the afore-
mentioned questions.
11
2.1.2 Traffic Capture and Preprocess
Comparing to those monitoring systems deployed in buildings or universities [26]
[19], there are several challenges of wireless traffic monitoring in a metropolitan area:
1) the number of access points that are observable (i.e., accessible to malicious users) is
very large; 2) the locations and distributions of access points are unknown; 3) we do not
have permissions to administrate these access points. Therefore, the traditional ways of
obtaining traffic are not practical. We cannot configure all of these access points to log
their real-time traffic, nor can we deploy thousands of static stand-alone monitoring nodes
to cover the whole area.
The SMoWF system delegates the traffic capturing tasks to wireless monitoring
points, such as laptops, tablets or even smartphones, being either stationary or mobile.
These monitoring points passively capture nearby wireless network traffic, and periodi-
cally upload the encrypted or hashed traffic logs to a central repository. Particularly, in
our experiments, we use Kismet [4] installed on a MacBook Pro to gather raw wireless
network traffic. Kismet is a 802.11 wireless network sniffer working with any wireless
card supporting monitoring mode, and detects networks by passively collecting network
packets. It can provide GPS coordinates where packets are observed when integrated with
a GPS device. Kismet generates two important log files - pcapdump and gpsxml. All the
packet information above the MAC layer and the Per-Packet Information (PPI) header
that includes the transmitting channel, received signal and noise strength, are logged in
pcapdump files. GPS information such as coordinates and speed are recorded into gpsxml
files. Kismet gathers raw packets into pcapdump files and records monitoring locations
into gpsxml files. Then, tshark [9] is used to parse and extract selected fields of raw
packets into human-readable texts.
12
2.1.3 Evidence Preservation and Post-investigation
For evidence preservation, it is not efficient enough to reduce data storage size by
only extracting packet headers. The network traffic size can be huge compared to limited
storage capabilities. For instance, in Section 2.1.4, 362,305 packets (i.e., about 311MB
trace data) were collected by randomly walking around a three-block neighborhood for 12
trips lasting around four hours. These data came from only one single monitoring point.
If tens or hundreds of monitoring points participate, it can easily get to 10-100 GB traces
in one day. For instance, Jigsaw [19] collected 96GB 802.11 raw traces in one day using
192 radio monitors.
I make statistic analysis on the packet types among the collected trace. As the result
in Section 2.1.4 presents, half of the packets are beacon frames sent from access points.
The beacon frames are used to claim the existence of WiFi networks and are not related to
the associated clients. Therefore, the beacon packets can be filtered out. Other extracted
information is stored into a database in order to support efficient queries and conduct post
forensics investigation.
The post-investigation component aims to answer the questions described in Sec-
tion 2.1.1, which can be summarized as device identification and device localization prob-
lems. As a preliminary work for the SMoWF system, MAC addresses are used as the
unique identifiers of mobile devices. For device localization, I study and evaluate two
localization algorithms: the weighted centroid algorithm [20] and the log-distance path
loss modeling method [40]. The weighted centroid algorithm simply estimates the loca-
tion of a target device as the weighted sum of all locations where it is observed, described
in Equation 2.1. p is the estimated location of the target device, pi
is the ith location co-
ordinate where the device is observed, and the weight wi
is proportional to si
the signal
13
strength received from the target device at ith location.
p =
X
i
wi
pi
, wi
/ si
,X
i
wi
= 1. (2.1)
The log-distance path loss modeling method describes that the average received
signal strength decreases logarithmically with distance, shown in Equation 2.2.
si
= s0 � 10� log di
+X�
, (2.2)
where si
is the received signal strength at Position i. di
is the physical distance from
the target device with coordinates (x, y) to the monitoring point with coordinates (xi
, yi
).
The path loss exponent � indicates the loss rate of the received signal strength. s0 is the
received signal strength at a distance of one meter. To compensate for the random shadow-
ing effects in signal propagation, X�
is added as a zero-mean Gaussian distributed random
variable with standard deviation �. Theoretically, four monitoring points or monitoring
locations are required to determine four parameters (s0, �, x, y) of the target device so as
to know the location coordinate (x, y). However, in practice, the target device is observed
at more than four locations, which results in a set of over-determined equations. There are
several solutions to this problem. For example, Krishna [21] proposed to find solutions to
minimize the least mean absolute error of the equations. In the SMoWF system, to sim-
plify the implementation, the trust-region-reflective optimization approach implemented
in Matlab is used to minimize the least square error, which is defined as follows:
J =
X
i
(si
� s0 + 10� log di
)
2 (2.3)
2.1.4 Experiments and System Prototype
Trace Storage To better understand the wireless traffic, I conduct several war-
14
Figure 2.2: Testbed and Walking Path.
walking experiments in a three-block metropolitan area of the upper-west side in NYC
shown in Figure 2.2, which is around 260m*260m. I use a MacBook Pro laptop with
internal airport wireless card and equipped with a BU353 GPS receiver as a moving mon-
itoring point. Kismet, installed on the MacBook Pro, is configured to hop on channels
to cover the entire spectrum and log all received wireless packets. I walked around the
neighborhood for 12 trips along the path of A-H or H-A in a week of April of 2011. We
collected 362,305 packets around 311MB traces.
Figure 2.3: Distribution of Observed 802.11 Packet Types.
The distribution of packet types of these traces is shown in Figure 2.3. As can be
observed, half of the packets are the beacon frames. The beacon frames are sent from
access points and do not provide information about any mobile device. They make little
15
Figure 2.4: Average Error Distances for Device Localization.
contribution to forensics investigation. Therefore, the beacon packets can be left out or
only preserved periodically. On the other hand, when a user is uploading or downloading
an image or a document or a video, there are bunches of data packets in the session.
Several of these packets in the same session are enough for the identification and tracing
back processes. Flow or session identifying techniques (e.g. sequence number) can be
used to identify the start and the end of a session, then only selected packets are preserved
in the repository as digital evidence.
Device Localization To evaluate the two localization algorithms described in Sec-
tion 2.1.3, I conduct the following experiments. Two smartphones are used as target
devices and I choose five test locations where they are able to connect to nearby WiFi
networks. At the same time, I collect their communication traces using one monitoring
point. To get the ground truth of the test locations, I run the GPS receiver for 2 minutes
statically at these locations and average the coordinate readings as the ground truth. For
each test location, I use the Haversine formula [3] to calculate the error distances between
the ground truth and estimated coordinates of the two devices through two algorithms.
As Figure 2.1.4 shows, the average error distances for two devices are around 31 and
40 meters using the weighted centroid method, 44 and 87 meters using the log-distance
path loss modeling approach. Hence, the SMoWF system applies the simple but effective
16
weighted centroid algorithm.
System Prototype As described before, a MacBook Pro laptop with built-in airport
wireless card and an external BU353 GPS receiver is used as a moving monitoring point.
It runs Kismet, which is configured to hop on channels to cover the entire spectrum.
After trace collection, raw data are parsed and selected packet fields are extracted
into our database. These fields contain frame date and time, MAC addresses including
source address, destination address, BSSID, transmitter address, and receiver address,
data length, channel frequency, received signal strength, type and subtype of 802.11, IP
source and destination addresses, source and destination port of TCP and UDP. The pay-
loads of these packets are dropped. Notice that a packet does not always include all the
fields. For example, 802.11 Acknowledgement and Clear-To-Send packets only contain
receiver MAC address and no other MAC address. One packet only has source or desti-
nation port either from TCP or UDP. Neither TCP nor UDP information can be obtained
from encryption packets. Furthermore, the location information of the monitoring points
are extracted into the database, which contains date, time, source MAC address, signal,
noise, latitude, longitude, altitude, fix, speed, heading. Indices are created on date fields
to speed up queries.
A simple web user interface is developed to help investigators to trace back their
interested device. As shown in Figure 2.5, an investigator can enter a date and time
period of an interesting event, and/or a MAC address, IP, BSSID. The SMoWF system
then pulls out the records that are related to the input information from the database,
shown in Figure 2.5(a). Meanwhile, it estimates the locations of the tracked device with
five-minute window using the weighted centroid algorithm, shown as Figure 2.5(b). It
also generates a KML file that visualize the geo locations of the device in Google Earth,
shown in Figure 2.5(c). In this way, the investigators can easily track the locations of their
interested device.
17
(a)
(b) (c)
Figure 2.5: System Prototype
2.1.5 Summary
In this work, a wireless forensic monitoring system (SMoWF) is outlined, which
aims to establish a forensic database based on wireless trace digests, and to answer the
following investigation questions: 1. Was a particular device involved in a given malicious
network activity? 2. Can this device be uniquely identified by the logs? 3. Where a
particular device was physically located during a given event. I conducted research and
experiments for the following tasks: 1. Design network trace logging method that records
the abstract of useful fields of network packets. Here only abstracts of packets are kept for
privacy protection purpose. 2. Design a query/search system that allows users to conduct
forensic analysis activities based on monitored traces; 3. Study localization algorithms
that can provide the location information of a given device.
18
2.2 Beagle: Physical Intruder Detection using WiFi networks
Home security is a big concern to everyone. Current home security systems rely on
placing various types of sensors to monitor different information about a house [27, 29].
For example, door and window sensors are installed to detect the open or close status
of doors and windows; motion sensors are placed in critical locations to detect human
movements; other monitoring sensors such as fire and smoke sensors, water sensors, tem-
perature sensors, etc. are deployed to ensure the house being kept in a good condition.
However, the installations and maintenance of such sets of sensors are very complex and
time-consuming. In addition, the cost of these home security systems is very high. Many
house owners will not buy and use these systems.
On the other hand, the use of smartphones has become an integral part of people’s
daily life. Smartphones with WiFi turned on actively send probe packets to search for
available WiFi networks. The probe packets contain MAC addresses of smartphones,
which can be an unique identification of a user. Some marketing companies and location
analytics firms such as Euclid Analytics [2] and Turnstyle Solutions [10] use MAC ad-
dresses to profile and track actual and potential shoppers and customers for stores. 100
recycling bins were installed in the city of London before the 2012 Olympics to monitor
the phones of passers-by using the same technique [8]. Similarly, Presence Orb [7] ap-
plies it to detect if buses or bars are full. Furthermore, when a person moves around the
area of wireless networks, the signal propagations of the wireless links would be affected,
which leads to the the variation of measured received signal strength (RSS).
In this work, I utilize two aforementioned facts, and design a low-cost home security
system called Beagle to tackle one of the most important tasks in home security - physical
intruder detection, which aims to detect unexpected physical intruders, prevent property
loss and keep homes safe. Unlike traditional home security systems relying on placing a
bunch of sensors inside the house, Beagle is based on existing WiFi infrastructures with
19
ubiquitous WiFi signals penetrating the walls and thus extending our senses. With all
the functionalities (e.g., monitoring, detection, alarming) built into one box, it is easy for
users to deploy and use Beagle.
2.2.1 Design Considerations
In this section, I discuss some typical scenarios in physical intruder detections for
home security, and then describe how we can deal with these scenarios with the design of
Beagle. The detailed design is described later.
Let us consider the six following typical scenarios in physical intruder detections:
1. The owner is in the house;
2. The owner has a guest in the house;
3. An unknown person passes by the house within a short time;
4. An unknown person is in the neighbor’s house;
5. An unknown person is around the house for a relatively long time.
6. An unknown person is in the house;
Intuitively, the first three cases are normal (i.e., no intruders), and happen a lot in our
daily life. For example, in the third case, neighbors or simply strangers may pass by our
house many times, but the time they stay is very short. In addition, we won’t care about
whether there are thieves in the neighbor’s house, described as the fourth case. Hence, the
Beagle system should not alarm the user for the first four scenarios. On the other hand,
when some unknown person is lingering around a house (i.e., Scenario 5), the owner of
the house probably will want to know that. When an intruder or a stranger is in the house
(i.e. Scenario 6), the owner definitely would like to be alarmed.
With the ubiquity of 802.11 based wireless networks, localization using wireless
signals is becoming a popular research topic. Many methods are proposed to achieve
high accuracy in positioning static or/and moving subjects. However, these methods only
focus on answering questions such as whether there is a subject and where it is. They
20
do not consider whether this subject is expected or not. Hence, these methods cannot be
directly applied in our scenarios.
The Beagle system is based on the residential WiFi networks to detect unexpected
physical intruders and stay armed for home security. Generally, Beagle automatically
learns and maintains a whitelist that contains users’ identifications (i.e., MAC addresses
of users’ smartphones), which is used to differentiate an authorized or known person from
a suspicious or unknown person. It further infers whether an unknown person is expected
from the presence of the house owners. Also, Beagle sets up RSSI thresholds to infer
whether the person (i.e. the target device) is in the area of interests, such as users’ house
or neighbors’ houses. Finally, Beagle combines the effects of moving people on RSSI to
detect the intruder entering into the house.
2.2.2 System Overview
Based on the aforementioned design considerations, the Beagle system should be
able to a) differentiate between non-intruders (e.g., friends, people pass by) and the actual
intruders, b) detect the actual physical intruders; c) send alerts to the user instantly in
b)’s condition. To accomplish these tasks, the Beagle system is designed with three main
components: Packet Capture and Information Extraction, Intruder Detection, and Alert
System.
In Packet Capture and Information Extraction, Beagle monitors and collects WiFi
frames around a house by placing a WiFi monitoring point at an important location in the
house (the places that intruders will enter into, e.g., the living room, the hallway). Mean-
while, it extracts and analyzes certain significant fields such as device identifications (i.e.,
MAC addresses) and received signal strengths for intruder detection. In Intruder Detec-
tion component, Beagle checks and models some extracted fields, and triggers the alert
system when any intruder is detected. In Alert System, house owners will be informed
with instant message (e.g., gtalk in current implementation) upon the detection of any
21
intruder.
The Intruder Detection part is the key component of the Beagle system. As men-
tioned before, two facts are utilized to design and implement this component. Firstly,
by monitoring the probe packets sent from nearby smartphones, any suspicious MAC
address coming from a potential intruder can be detected, which derives a MAC-based
Detection approach. Secondly, it can be detected the intruder’s activities in the house
through the resulted abnormal RSS variances in WiFi networks, which derives the design
of an RSSI-variance-based Detection approach.
2.2.3 MAC-based Detection
The MAC-based detection method aims to detect intruders carrying WiFi devices.
As we described, a WiFi-enabled device actively broadcasts probe request to search avail-
able networks. The probe request packet contains the MAC addresses of the device, which
serves as an identification of the owner of the device. By looking at its received signal
strength indicator (RSSI), it is possible to infer the approximate distance of the target
device. To design the MAC-based detection method, two measurement-based studies are
conducted to answer the following critical questions:
• How can we capture the probe packets if any?
• How can we tell the target person is inside or outside?
a) Study of smartphones probing behaviors
Figure 2.6: Active Scanning Procedure
22
Figure 2.6 demonstrates a simplified version of active scanning procedure through
probe request frames in 802.11 protocols. Generally, upon the receipt of scanning request
from operating system, a mobile device starts a probe cycle and scans its supported chan-
nels by sending out probe requests to each channel one by one. During this probe cycle, it
waits for some time (i.e., probe delay) between sending any two probes. If any response is
received, it will process the probe responses. Otherwise, it waits for the next scan request.
In this procedure, many parameters (e.g., the order of the scanned channels, the
number of probes sent on each channel, and the probe delay) that would affect the packet
capture vary with device models. To infer these parameters, I conduct a probing behavior
study on five popular smartphone models on market, including iPhone 4S, iPhone 5 and
5S, Samsung Galaxy S5 and HTC One (M7). In the study, a sniffer is set up using a
laptop, its built-in WiFi network card and an external WiFi network card. I configure both
cards in monitor mode, with one listening on Channel 1 and the other on Channel 11.
The smartphones are set in asleep status, but the installed apps are still running, and WiFi
networks are turned on. The collection of the probe request frames lasts around one hour.
After the collection, the parameters are inferred based on the fields of probe frames, such
as timestamps, sequence numbers, and channels. For example, the probe delay time can
be derived by calculating the difference between the timestamps of two adjacent probes
in the same probe cycle. The number of the probes per channel can be calculated by
counting the number of probes observed on the same channel in the same probe cycle.
Table 2.1 lists the detailed results. As can be observed, all of the studied mobile
devices support channels of 802.11b/g/n. They send one or two probes on each channel.
The probe delay between two adjacent probes is around 20ms or 40ms. Hence, to ensure
the capture of the probes if any, we need to stay on any channel of b/g/n (i.e., Channel
1-11) for more than two seconds. To make things simpler, the sniffer is fixed with one
monitoring channel in the implementation.
23
Table 2.1: Study of Popular Smartphones’ Probing Behaviors.Device Channels #Channels #Probes Per Channel Probe Delay (ms) One Probe Cycle (ms) Observed Waiting Time
iPhone 4S b/g/n 11 2 20 440 2s-43 minsiPhone 5/5S a/b/g/n 20 1 40 800 2s-10 mins
Samsung Galaxy S5 a/b/g/n/ac 21 2 40 1620 2s-3 minsHTC One (M7) a/b/g/n/ac 21 2 40 1620 4s-30 mins
(a) iPhone 5 (b) Samsung Galaxy 5
Figure 2.7: Regular Used Smartphones’ Waiting Times in Worst Case Scenarios.
As presented in Table 2.1, the waiting time between two probe cycles varies with
different devices. Generally, the newer model is observed having less waiting time. I fur-
ther measure the waiting times of regularly used devices in worst case scenarios. Specif-
ically, I check how often two most popular models, iPhone 5 and Samsung Galaxy S5,
send probes with all installed apps being closed during night time (i.e., no user interac-
tion). Figure 2.7 shows the CDF plots of their observed waiting time intervals. As we can
see, for iPhone5, 95% of waiting times are less than 10 minutes and 85% of waiting times
are less than 5 minutes; for Samsung Galaxy S5, 95% of waiting times are less than 5
minutes. This result indicates that if an intruder stays longer than 5 minutes, the detection
probability is very high.
b) Study of signal strength from probe frames
Wireless signal strengths have been used to learn the locations of the mobile de-
vices. For example, the mobile device localizes itself by comparing the received signal
strengths of APs against the pre-calibrated wireless map [13, 37]. My task is different
24
from those localization problems in two aspects: one is that I aim to track the target de-
vice, rather than that the target device localizes itself; the other is that only one sniffer or
monitoring point is used, so the traditional device localization methods cannot be applied
in the Beagle system.
To answer the second question, I propose to use the received signal strengths of
probe frames to infer the target device’s coarse location. A measurement study is con-
ducted to explore how the RSSIs of the probes of the smartphones change in different
positions around a house. In the study, a sniffer is placed in the living room, and a few
important locations are explored with two smartphone models, iPhone 5 and Samsung
Galaxy S5. At each location, I put the smartphone in the pocket and collect about 100
probes when I face the sniffer and turn my back to the sniffer respectively. As described
before, smartphones would wait for seconds to minutes to start an active scanning, which
would make the measurement study time-consuming. It is noticed that when the smart-
phone is waked up from the sleep status, it starts an active scan for available networks
immediately. Hence, the smartphones are made to switch between awake and asleep con-
stantly to speed up the collection during the measurement.
Figure 2.8: Study of signal strength from probe message.
25
Figure 2.8 shows the CDF plot of the RSSIs of Samsung Galaxy S5. The line colors
denote different places: blue means the smartphone is far away from the sniffer (around 10
meters); magenta denotes it is outside the door (around 4 meters away from the sniffer);
and the black lines are measured when the smartphone is at the door but inside home
(around 4 meters away from the sniffer). The line types denote whether the human faces
or turns back to the sniffer. As can be observed, the RSSIs significantly decrease due to
the presence of human body. To infer the approximate distance of the target device, with
85% confidence, -75 is selected as a low threshold to indicate the device is far away, and
-62 is picked as a high threshold to indicate the device is very close to the sniffer. Similar
results are observed with iPhone 5.
c) The Proposed Algorithm
Figure 2.9: MAC-based Detection: Detecting intruders carrying WiFi devices.
Figure 2.9 demonstrates a general idea of how the MAC-based detection method
works. A sniffer with the capability of monitoring WiFi networks is set up in an important
spot of the house (e.g., the living room). The sniffer keeps capturing probe request frames.
If the frame is from some known MAC, or unknown MAC address but in presence with
26
known MAC address, the detected device comes from the home owners’ or their guests.
If the frame is from short-lived MAC address with very weak RSSI, the detected person
might be innocent. However, if an unknown MAC address with very strong RSSI is
observed and if no ”familiar” MAC address is accompanying it, it probably comes from
an intruder.
Algorithm 1: MAC-based DetectionInput : PR = htimestamp,MAC,RSSIi, RSSI Low,RSSI High
1 if PR.MAC not in Whitelist then/
*
user not at home or current time is at night
time or user asleep at daytime
*
/
2 if ! isUserAtHome() or isNight() or noMovement() then3 if PR.RSSI > RSSI High then4 alert();5 end6 else7 if PR.RSSI < RSSI Low then8 return;9 end
10 else11 if PR.MAC in Probelist && The time interval from last probe
from PR.MAC> 5 mins then12 alert();13 end14 else15 Save PR in Probelist;16 end17 end18 end19 end20 end
Accordingly, Algorithm 1 describes the implementation of the MAC-based detec-
tion approach. At the beginning, the MAC addresses of the house owners’ smartphones
are input, which is used to infer whether the house owners are at home (isUserAtHome()).
A night time period is set up to determine the activation of the Beagle system (isNight()).
During the daytime, the detection is activated only when there is no movement in the
27
house (noMovement(), implemented based the RSSI-variance-based method). A whitelist
(Whitelist) is used to store trusted devices, and a probe list (Probelist) is used to main-
tain the MAC addresses of probes in recent half an hour. Besides, the observed probe
request is represented as a triple PR = htimestamp,MAC,RSSIi. As Algorithm 1 de-
scribes, the Beagle system stays armed in the circumstances that a) the user is not home;
b) during night time; c) it is daytime but the user is asleep. In the armed status, if a
device that does not belong to the whitelist is detected with an RSSI exceeding the high
RSSI threshold (RSSI High), or it stays longer than 5 minutes with relatively high RSSI
(RSSI Low PR.RSSI RSSI High), the Beagle system would send alert to the
house owner.
2.2.4 RSSI-variance-based Detection
As Figure 2.10 demonstrates, when an object moves around the area covered by
wireless networks, the transmissions of both line-of-sight links and non-line-of-sight links
would be affected. Both would lead to the the variations of measured received signal
strength (RSS) on the links. Based on this fact, an RSSI-variance-based detection method
is proposed to detect physical intruders, to compensate for the limitation of the MAC-
based detection approach – an intruder cannot be detected if he or she does not carry any
WiFi enabled device.
Figure 2.10: RSSI-variance-based Detection: Detecting intruders without carryingWiFi devices.
28
Although plenty of research work propose intelligent algorithms to track and local-
ize a moving object based on the variance of RSSIs it makes [48,51], most of them deploy
a network of sensor nodes in the monitoring area and assume an ideal environment (i.e.,
no other object except the target one is in the area). It’s hard to meet the above require-
ments in home environment. Instead, access points in home WiFi networks broadcast
beacon frames at a high frequency (usually one beacon frame every 100ms) to announce
the presence of WiFi networks, which can be utilized to detect an intruder.
a) Effects of moving people on RSSI
As a first step, I study the impact of moving people on the RSSIs of beacon frames.
The experiments are set up as follows: a sniffer is placed in the living room, one AP (AP1)
is in the same room and the other AP (AP2) is in a different room (e.g., a bedroom). Both
APs and the sniffer are set working on the same channel. The sniffer is started and the
house is left empty for around 10 minutes. Then, I enter into the house, and walk around
the living room and the room with AP2 placed in to simulate an intruder’s behavior.
(a) AP 1 (b) AP 2
Figure 2.11: Effects of moving people on RSSI of Beacons (x axis is time in second,y axis is RSSI).
Figure 2.11 demonstrates the measured RSSIs of beacons from AP1 and AP2. It
can be observed that the RSSIs from both APs are noisy. Due to multi-path effects, two
29
sets of dominant RSSIs (around -50 and -60) are observed from AP2. In addition, it is
observed that with a moving person, the RSSIs of both APs change significantly around
the 10th minute.
b) The Proposed Method
Based on the previous preliminary results, I plan to propose an effective method to
detect an intruder.
2.2.5 Summary
In this work, I propose a low-cost home security system based on the existing WiFi
infrastructures to detect unexpected physical intruders. Specifically, I explore two main
methods - the MAC-based detection approach and the RSSI-variance-based detection ap-
proach to detect intruders with/without WiFi devices. Preliminary results demonstrate
both methods would be feasible and effective.
CHAPTER 3
Pattern Discovery and Missing Value Recovery in Data Analysis
3.1 Understanding the Dynamic Landscape of Smartphone App Us-
age
There has been many prior studies focusing on the understanding of smartphone
usage, which can be broadly divided into two categories based on the data collection
methodology. The UE (User Equipment) based collection requires mobile users’ consent
and participation. For example, [22] analyzed data from 77 participants in European
countries on a specific model of Nokia smartphone for 9 months; [24] studied traces from
43 mobile users with up to 5 months usage and broadened later in [25] to 255 users for up
to 28 weeks. The UE based approach is fundamentally limited in scale and the result can
be highly influenced by the participant samples — e.g., 16 out of the 33 Android users in
[25] being high school students may not represent the real distribution of user population.
On the other hand, the in-network measurement approach collects usage data within the
data communication networks. For example, [46] studied a one-week network trace in a
3G mobile network in a large metropolitan area; and [43] and [50] analyzed week-long
nation-wide traffic traces both taken from a major 3G cellular service provider in USA;
and [34] studied the usage behavior of mobile handheld devices connected through home
WiFi over residential DSL lines using 4 one-day traces for 20K subscribers spanning
11 months. Maintaining a large scale measurement and traffic processing system in any
operational network is tremendously challenging. Hence in-network measurement study
is often limited in duration, making it difficult to track the evolution of the smartphone
apps’ usage over time.
In this work, I analyze smartphone usage traces monitored in a major cellular
30
31
(3G/LTE) service provider in USA and in a regional residential high speed DSL ISP
network for over two years. The unparalleled scale of the measurement data provides
us the opportunity to understand the long term popularity changes of smartphone apps’
usage, the trending and seasonality of the usage patterns of individual apps, as well as
the capability to contrast how people are using their smartphones in the comfort of their
home versus while being mobile over the cellular network. Overall, my analysis provides
valuable insights for social scientists, network operators and content providers.
Due to the company policy, I cannot release the details of the analysis now.
3.2 Fast Mining of a Network of Coevolving Time Series
3.2.1 Introduction
Coevolving multiple time series are ubiquitous and naturally appear in a variety
of high-impact applications, ranging from environmental monitoring, computer network
traffic monitoring, motion capture, to physiological signaling in health care and many
more. In many scenarios, the multiple time series data is often accompanied by some con-
textual information in the form of networks. For example, in a smart building equipped
with various sensors, each sensor generates a time series by measuring a specific type
of room condition (e.g., temperature, humidity, etc.) over time. All these sensors or
time series are connected with each other by the underlying sensor network; in epilepsy
study, the various sensor measures collected from different brain regions of a patient are
essentially inter-connected by his or her brain graph; in scholarly article citation analy-
sis, the citations of different journals or conferences are correlated particularly for those
from similar domains. We refer to such multiple time series, together with its embedded
network as a network of coevolving time series.
Figure 3.1 presents an illustrative example of such time series. Figure 3.1(a) is a
simplified version of sensor deployment at an office floor. The weights on the labeled
32
(a) Sensor network
0 1000 2000 3000 4000 5000 600014
16
18
20
22
24
26
time
tem
pe
ratu
re
(b) Time series
Figure 3.1: An illustrative example of a network of time series. (a) A simplifiedsensor network. (b) Measured temperature time series, each of which correspondsto a node with the same color of the sensor network in (a). The multiple time seriesin (b) are inter-connected with each other by its embedded network in (a). (Bestviewed in color).
edges indicate the closeness and connectivity between two sensors. Figure 3.1(b) is two-
day temperature time series measured from (a)’s sensor network. The color is used to
differentiate different sensors and their corresponding time series. As can be seen, the
purple line and the cyan line are very close to each other, and the weight between their
sensors is relative high.
In order to unveil the underlying patterns of a network of coevolving time series, in
particular to fill in the missing values of one or more input time series, I adapt the concept
of collaborative filtering that multiple coevolving time series and/or the contextual net-
work are determined by a few latent factors, and propose DCMF, a dynamic contextual
matrix factorization algorithm to simultaneously find the common latent factors in both
the input time series and its embedded network.
In the last part of my proposal, I introduce a novel setting on mining a network of
multiple coevolving time series and propose a dynamic contextual matrix factorization
algorithm to find the missing values in a network of time series. The preliminary ex-
perimental result on a real sensor network dataset demonstrates the effectiveness of the
33
Table 3.1: Symbols and DefinitionsSymbol Definition and DescriptionA,... matrices (bold upper case)A
ij
the element at the ith row and jth column of matrix AA
j
the jth column of matrix AA(i, :) the ith row of matrix AA0 transpose of matrix Aa,b, ... column vectors (bold lower case)� element-wise multiplicationX observed time series (x1, ..,xT
)W indicator matrixS contextual matrixZ time series latent matrix (z1, ..., zT )B transition matrixU object latent matrixY contextual latent matrixn number of objects in X and ST length of time seriesl dimension of latent variables
proposed algorithm, which outperforms its competitors, especially when there are lots of
missing values.
3.2.2 Problem Definition
Table 3.1 lists the main symbols used throughout this work. Besides the standard
notations, a matrix Xn⇥T is used to denote observed time series with n dimensions and
T time steps, where Xit
denotes the data from ith object at time t. An indicator matrix
Wn⇥T is used to indicate data is observed or missing in X, i.e., Wit
= 0 when Xit
is
missing, otherwise Wit
= 1. A contextual matrix Sn⇥n is used to represent the embedded
network.
With the above notations, a Network of Time Series (NoT) can be formally defined
as follows:
Definition 3.2.1 A Network of Time Series (NoT). Given an n dimensional partially ob-
served time series with T time steps X, an indicator matrix W, an n⇥n contextual matrix
34
S representing the network behind X, and a one-to-one mapping function ⇣ , which maps
each dimension of the time series X to a node in S, a Network of Time Series is defined
as the quadruplet R = hX,W,S, ⇣i.
The problem of time series missing value recovery can be defined as follows:
Problem 3.2.1 NoT Missing Value Recovery.
Given: a network of time series R = hX,W,S, ⇣i;
Recover: its missing parts indicated by the indicator matrix W.
3.2.3 Dynamic Contextual Matrix Factorization
In this section, I present the solution for NoT Missing Value Recovery problem
(Problem 3.2.1). I start with the discussions of some baseline solutions and their limi-
tations, followed by our formulation to encode the temporal smoothness and the corre-
sponding algorithm.
3.2.3.1 Preliminaries and Challenges
I would like to point out that Problem 3.2.1 bears some similarity to the classic
collaborative filtering problem. To be specific, if the rows (time series) and columns
(time steps) of X are treated as users and items, respectively; the observed entries in X
as the ‘ratings’ between the corresponding user and item; and the contextual matrix S as
the ‘social network’ among different users, Problem 3.2.1 can be viewed as inferring the
missing or unknown ratings between the users (rows of X) and the items (columns of X).
Having this in mind, it seems that many collaborative filtering methods can be used to
solve Problem 3.2.1. For example, the matrix factorization based collaborative filtering
(Baseline Solution 3.2.1) [15, 30] can be used, if I only use the partially observed time
series matrix X. If I further incorporate the contextual matrix S, it is essentially a social
recommendation problem (Baseline Solution 3.2.2) [28, 32].
35
Baseline Solution 3.2.1 Collaborative Filtering (via Matrix Factorization)
Given: a rank size l, and a rating matrix Xn⇥m with missing values indicated by W,
where Xij
denotes user i’s rating on item j and Xij
= 0 if there is no rating,
Find: user latent factor Un⇥l and item latent factor Zl⇥m such that X ⇡ W � (UZ).
The rating predictions are estimated as the non-zero values of ˜Xn⇥m, where ˜X = (1 �
W)� (UZ).
Baseline Solution 3.2.2 Social Collaborative Filtering (via joint matrix factorization).
Given: a rank size l, and a rating matrix Xn⇥m with missing values indicated by W,
where Xij
denotes user i’s rating on item j and Xij
= 0 if there is no rating; a social
network matrix Sn⇥n, where Sij
indicates how much user i trust or is influenced by user
j,
Find: a user latent factor Un⇥l, an item latent factor Zl⇥m and a social latent factor
Vl⇥n such that X ⇡ W � (UZ) and S ⇡ UV. The rating predictions are estimated as
the non-zero values of ˜Xn⇥m, where ˜X = (1�W)� (UZ).
Baseline Solution 3.2.1 and 3.2.2 can be represented by the graphical models shown in
Figure 3.2(a) and (b). Both solutions are to find the similarities between users indicated
by the user latent factor U. Baseline Solution 3.2.1 only uses the rating matrix, while
Baseline Solution 3.2.2 fuses the rating matrix with the social matrix and finds the user
latent factor shared by both matrices. Such similarities between time series also exist in
many multiple time series applications. For example, in time series obtained from a sensor
network, nearby sensors may collect similar measurements; in water quality dataset, the
chlorine concentration levels of two connected pipes can be very close.
However, in both Baseline Solution 3.2.1 and 3.2.2, the temporal smoothness, a
unique characteristic of time series data, is ignored. For instance, the sensor data (e.g.,
temperature/light/humidity) or the chlorine concentration levels collected in two consec-
utive time steps might not change a lot. Ignoring such temporal smoothness is likely
36
to lead to sub-optimal or even infeasible solutions of Problem 3.2.1. For example, both
Baseline Solution 3.2.1 and 3.2.2 would fail if a certain number of columns of X are
missing entirely.
3.2.3.2 Proposed Formulation
I propose a novel algorithm, Dynamic Contextual Matrix Factorization (DCMF),
based on joint matrix factorization and linear dynamic systems, as illustrated in Fig-
ure 3.2(c).
Z
X
U
(a) Matrix Factorization
Z
X
U
S
V
(b) Joint Matrix Factorization
z1 z2 zt�1 zt zt+1
x1 x2 xt�1 xt xt+1
U
SV
(c) Dynamic Contextual Matrix Factorization
Figure 3.2: Comparison between the baseline solutions (a and b) and the proposedformulation (c).
To model the temporal smoothness, the time series X is re-written as column vec-
tors (x1,x2, ...,xT
), where each xt
represents n dimensional data at time t; and the cor-
responding hidden factors Z re-written as (z1, z2, ..., zT ). Then zt is defined to be linear
37
to zt�1 with a transition process B and multivariate transition noise !t
⇠ N (0,⇤). The
state for the latent variable at first time step z1 is defined by z0 with multivariate Gaussian
noise !0 ⇠ N (0, 0):
z1 = z0 + !0, zt
= Bzt�1 + !
t
, (3.1)
Similarly to the joint matrix factorization model, xt
is defined to be linear to zt
through the latent factor U with multivariate Gaussian noise ✏t
⇠ N (0,⌃), sj
to be linear
to vj
through the latent factor U with multivariate Gaussian noise ⌧j
⇠ N (0,⌅):
xt
= Wt
� (Uzt
+ ✏t
), (3.2)
sj
= Uvj
+ ⌧j
. (3.3)
Zero-mean spherical Gaussian priors are placed on V under the assumption that each
entry of S is scaled to [0, 1]:
p(V|�) =nY
j=1
N (vj
|0,�). (3.4)
In order to accommodate the proposed model in sparse time series datasets, the
multivariate Gaussian noises are simplified by assuming that the transition noises and
observation noises are independent and identically distributed (i.i.d.), and the covariances
of latent variable {vj
} are i.i.d., i.e. vj
⇠ N (0, �2V
), and
!t
⇠ N (0, �2Z
), ✏t
⇠ N (0, �2X
), ⌧j
⇠ N (0, �2S
),
where t = 1, ..., T and j = 1, ..., n.
With Eq. (3.1)-(3.4) and the parameters ✓ = {B, z0, 0, �2Z
, �2X
, �2S
, �2V
}, the goal
38
here is to maximize the joint distribution described as follows:
argmax
Z,U,V,✓
p(Z,X,V,S,U) = p(z1)TY
t=2
p(zt
|zt�1)
| {z }temporal smoothness
TY
t=1
p(xt
|zt
,U)
| {z }observed time series
⇥nY
j=1
p(vj
)
nY
j=1
p(sj
|vj
,U)
| {z }contextual network
, (3.5)
where the model parameters is omitted in the equation.
3.2.3.3 Proposed Algorithm
In Eq. (3.5), U and zt
jointly determine xt
, and U and vj
jointly determine S. Due
to such coupling, it is difficult to find its global optimal solution. Hence, we aim to find
its local optimal instead, by grouping {zt
,vj
} and updating U and {zt
,vj
} alternatively.
Regarding U as a parameter of the model, the solution of the proposed model contains
parameters ✓ = {U,B, z0, 0, �2Z
, �2X
, �2S
, �2V
} and the latent factors {zt}, {vj
}.
To be specific, the expectation-maximization (E-M) algorithm is adopted to find a
local optimal solution of the model as follows. The algorithm first randomly sets the initial
values of the parameters ✓old, then iteratively performs E-M steps until the convergence
criterion is satisfied. In the E step, it infers the posteriors of the latent variables {zt}, {vj
}
based on the old parameters ✓old. In the M step, it updates the new parameters ✓new that
maximize the expectation of the distribution likelihood.
E Step: updating the latent factors
Given U, {zt
,xt
} are independent with {vj
, sj
}. The posteriors of {zt
} and {vj
}
can be estimated independently w.r.t. U (steps 3-12 in Algorithm 2).
Eq. (3.1) and (3.2) are very similar to linear dynamic systems, except that only
observed entries in xt
contribute to Eq. (3.5). Let ot
denote the indices of the observed
entries of xt
, and x⇤t
denote the observed entries of xt
, which is a compressed version of
39
xt
, Ht
is used to represent the corresponding compressed version of U. Formally, ot
, x⇤t
and Ht
can be formulated as:
ot
= {i|Wit
> 0, i = 1, ..., n},
x⇤t
= xt
(ot
, :), Ht
= U(ot
, :) (3.6)
x⇤t
= Ht
zt
+ ✏t
Then, the forward-backward algorithm [16] or Kalman Filter and RTS smoother [41, 44]
is applied to find the posteriors of the latent variables {zt
}. Define
p(zt
|x1, ...,xt
) = N (zt
|µt
, t
),
p(zt
|{x1, ...,xT
) = N (zt
|ˆµt
, ˆ t
). (3.7)
By using the forwarding algorithm, the following equations can be derived:
µt
= Bµt�1 +K
t
(x⇤t
�Ht
Bµt�1)
t
= (I�Kt
Ht
)Pt�1
Pt�1 = B
t�1B0+ �2
Z
I
Kt
= Pt�1H
0t
(Ht
Pt�1H
0t
+ �2X
I)�1 (3.8)
with the initial status:
µ1 = z0 +K1(x⇤1 �H1z0)
1 = (I�K1H1) 0
K1 = 0H01(H1 0H
01 + �2
X
I)�1 (3.9)
40
Applying the backward algorithm, we have:
ˆµt
= µt
+ Jt
(
ˆµt+1 �Bµ
t
)
ˆ t
= t
+ Jt
(
ˆ t+1 �P
t
)J0t
,
Jt
= t
B0(P
t
)
�1 (3.10)
E[zt
] =
ˆµt
E[zt
z0t�1] =
ˆ t
J0t�1 + ˆµ
t
ˆµ0t�1
E[zt
z0t
] =
ˆ t
+
ˆµt
ˆµ0t
(3.11)
Then, Bayes’ theorem is applied on Eq. (3.3) and (3.4) to obtain the posteriors
p(vj
|sj
) = N (vj
|⌫j
,⌥):
⌫j
= M�1U0sj
, ⌥ = �2S
M�1
M = U0U+ ��2V
�2S
I
E[vj
] = ⌫j
, E[vj
v0j
] = ⌥+ ⌫j
⌫ 0j
. (3.12)
M Step: updating the model parameters
In the M step, the old parameters are updated to maximize the expectation of the
log likelihood (step 13 in Algorithm 2):
✓new = argmax
✓
Q(✓, ✓old)
Q(✓, ✓old) = EZ,V|✓old [ln p(Z,X,V,S|✓)], (3.13)
where p(Z,X,V,S|✓) is defined in Eq. (3.5).
To obtain ✓new, the probabilities in Eq. (3.5) are substituted with the defined dis-
tributions, and then the derivatives of Q(✓, ✓old) are calculated w.r.t. each parameter, and
41
finally a new parameter estimation is obtained by setting each derivative to zero. The
parameters are updated as follows:
znew0 = E[z1]
new
0 = E[z1z01]� E[z1]E[z01]
Bnew
=
TX
t=2
E[zt
z0t�1]
! TX
t=2
E[zt�1z
0t�1]
!�1
(�2V
)
new
=
1
nl
nX
j=1
tr
�E[v
j
v0j
]
�
(�2Z
)
new
=
1
(T � 1)ltr
TX
t=2
E[zt
z0t
]
�Bnew
TX
t=2
E[zt�1z
0t
]�TX
t=2
E[zt
z0t�1](B
new
)
0
+Bnew
TX
t=2
E[zt�1z
0t�1](B
new
)
0
!(3.14)
For U, each row U(i, :) is updated individually:
U(i, :)new = A1A�12
A1 = ���2S
nX
j=1
Sij
E[v0j
] + (1� �)��2X
TX
t=1
Wij
Xit
E[z0t
]
A2 = ���2S
nX
j=1
E[vj
v0j
] + (1� �)��2X
TX
t=1
Wit
E[zt
z0t
], (3.15)
where � is to trade off the contributions of the contextual matrix and time series. With
42
Unew, �2S
and �2X
are then updated as follows:
(�2S
)
new
=
1
n2
"nX
j=1
�s0j
sj
� 2s0j
(UnewE[vj
])
�
+tr
Unew
(
nX
j=1
E[vj
v0j
])(Unew
)
0
!#
(�2X
)
new
=
1
PT
t=1
Pn
i=1 Wit
TX
t=1
tr (Hnew
t
E[zt
z0t
](Hnew
t
)
0)
+
TX
t=1
((x⇤t
)
0x⇤t
� 2(x⇤t
)
0(Hnew
t
E[zt
]))
!, (3.16)
whereP
T
t=1
Pn
i=1 Wit
is the number of the observed entries in X and Hnew
t
is derived
from Unew based on Eq. (3.6).
The Overall Algorithm
Putting everything together, the overall algorithm (DCMF) to solve Eq. (3.5) is
summarized in Algorithm 2. Given a network of time series R, the dimension of latent
factors l, and the weight of contextual information �, the algorithm aims to approximate
the values of the latent factors U,Z,V. It first randomly initializes the model parameters
✓ (step 1); and then iteratively infers the expectations of Z, V (step 3-11), and updates
the model parameters ✓ including U (step 13) until converge. The missing values can be
inferred from the reconstructed time series ˆX (step 16).
3.2.4 Preliminary Result
3.2.4.1 Experimental Setup
A. Baseline Methods. The proposed method is compared with the following state-of-the-
art algorithms: Probabilistic Matrix Factorization (PMF) [35], Social Recommendation
(SoRec) [32], SmoothPMF, SmoothSoRec, and DynaMMo [31]. I also compare DMF,
which is a simplified version of DCMF when � = 0.
B. Motes Dataset. Motes Dataset is used to conduct the experiments. The Motes dataset
43
Algorithm 2: Dynamic Contextual Matrix Factorization (DCMF)Input : a network of time series R = hX,W,S, ⇣i,
dimension of latent factors l,weight of contextual information �
Output: U,Z,V1 Initialize ✓ = {U, B, z0, 0, �
2Z
, �2X
, �2S
, �2V
};2 repeat3 for t=1:T do4 Construct H
t
based on Eq. (3.6);5 Estimate µ
t
, t
using Eq. (3.8),(3.9)6 end7 for t=T:1 do8 Estimate ˆµ
t
, ˆ t
, E[zt
z0t�1],E[ztz0t] using Eq. (3.10), (3.11)
9 end10 for j=1:n do11 Estimate E[v
j
],E[vj
v0j
] using Eq. (3.12)12 end13 Update ✓ by Eq. (3.14) - (3.16)14 until converge;15 Set Z = (
ˆµ1, ..., ˆµT
), V = (⌫1, ...,⌫T
)
16 Reconstructed ˆX = UZ
consists of temperature measurement from the sensors deployed at the Intel Berkeley
Research Lab [5] over a month. The dataset also contains the sensors’ locations (x-y
coordinates) and averaged connectivity probabilities over time (i.e. the probability of a
message from a sensor successfully reaching another).
Let dij
and cij
be the two-dimensional Euclidean distance and the connectivity prob-
ability between Sensor i and Sensor j, respectively, each entry of the contextual matrix S
is defined as follows:
Sij
= ↵ · (1� dij
/max
i,j
(dij
)) + (1� ↵) · cij
,
where ↵ controls the contributions of the two parts, and is set to be 0.5 in the experiments.
C. Evaluation Metrics. To evaluate the effectiveness, the root mean square error (RMSE)
44
is used, which is defined as follows:
RMSE =
vuutP
i,t
(1�Wit
)(Xit
� ˆXit
)
2
Pi,t
(1�Wit
)
,
where Xit
is the observed value and ˆXit
is the corresponding reconstructed value, and W
is the indicator matrix. Wit
= 1 indicates that Xit
is an observed value (or is selected in
the training set), and Wit
= 0 indicates that Xit
is a missing value (or is selected in the
test set).
3.2.4.2 Result
To evaluate all the algorithms, I incrementally generate training and test sets of the
Motes dataset with an increasing amount of test data (1%, 10%, ..., 99.9%) within time
series. For example, to increase the size of test set from 1% to 10%, I randomly move 9%
of the observed data from the training set to the test set. In this way, the subsequent test
set always contains the test data of the previous one.
Figure 3.3 shows the results on the Motes dataset. As can be seen, the proposed
DCMF algorithm consistently outperforms the other methods, especially when the per-
centage of the missing values is large.
3.2.5 Summary
In this section, a novel setting on mining a network of multiple coevolving time
series is introduced and a Dynamic Contextual Matrix Factorization (DCMF) algorithm is
proposed to infer the missing values from the input time series. The proposed algorithm is
comprehensive, simultaneously capturing (a) the correlation among different time series;
(b) the contextual information encoded by the contextual network; and (c) the temporal
smoothness along the time dimension.
45
10 70 95 99.9 99.950
5
10
15
20
25
missing value%
RM
SE
DCMFDMFdynaMMoSmoothSoRecSmoothPMFSoRecPMF
Figure 3.3: Evaluation on missing value recovery in Motes Dataset (n=54,T=14400).Lower is better. DCMF outperforms all of the competitors.
CHAPTER 4
Proposed Work and Timeline
4.1 Outline of Proposed Work
My proposal focus on network monitoring and data analysis in wireless networks.
In the first part, I explore security monitoring in 802.11 based networks. I first identify
the challenges of security monitoring for network forensics, especially in the metropolitan
area and propose a distributed security monitoring system for wireless network forensics
- SMoWF (Chapter 2.1).
Based on the ubiquity of WiFi networks and mobile devices, I then propose a phys-
ical intruder detection system- Beagle (Chapter 2.2). It is low-cost and is easy to de-
ploy comparing to current expensive and complex home security systems. In Beagle,
the proposed MAC-based detection algorithm works for the intruders carrying with WiFi
devices. I aim to design and implement the RSSI-variance-based method to detect the
intruders without carrying WiFi devices or with their WiFi networks turned off. I plan
to study RSSI distributions in different time windows using histograms, and characterize
abnormal distributions to infer the presence of an intruder. In addition, I will leverage
neighbors’ APs in the algorithm to extend the sensing capability.
In the second part, I look into the analysis of the big network data, resulting from the
fast advances of wireless network technologies. I first focus on the study of smartphone
apps’ usage patterns, and analyze long-term anonymized traffic data for app usages. The
analysis provides valuable insights on the usage patterns and trend changes of a large
variety of smartphone apps (Chapter 3.1).
Then, I propose a Dynamic Contextual Matrix Factorization algorithm (DCMF) to
model a network of coevolving time series, which appear in many high-impact applica-
46
47
tions, especially in wireless sensor networks (Chapter 3.2). There are two tunable pa-
rameters in the proposed algorithm. Naturally, a next step is to automatically learn these
two parameters. Furthermore, except for missing value recovery, prediction and anomaly
detection are two important tasks in the applications of wireless sensor networks. I aim
to adapt the DCMF algorithm to accomplish these two tasks. In addition, it is observed
that lag correlations frequently exist in coevolving time series, which makes it necessary
to extend the DCMF algorithm to capture lag effects. Finally, it is very common that
network topology changes very often in wireless sensor networks. I would like to extend
the DCMF algorithm to make it work for the time series data from dynamic networks.
4.2 Timeline for Completion
The work plan to complete the dissertation is as follows:
1. November 2014 - December 2014, Design and Implementation of the RSSI-variance-
based detection method for the Beagle system.
2. January 2015 - April 2015,
(a) Analyze the efficiency of the DCMF algorithm and evaluate the DCMF algo-
rithm in terms of missing value recovery on multiple real datasets.
(b) Adapt the DCMF algorithm for prediction and complete the experimental
evaluations.
(c) Automatically learn the dimension of latent factors.
(d) Extend the DCMF algorithm to capture lag effects in a network of time series.
(e) Extend the DCMF algorithm for dynamic networks.
3. April 2015 - May 2015, Write the dissertation and prepare for defense.
4. Finally, Defend my thesis in May 2015.
CHAPTER 5
Conclusion
My proposal is dedicated to address the challenges in network measurement and man-
agement. In the first part of this proposal, I investigate the impact of wireless network
(specifically wireless local area network) on security monitoring. I first identify the secu-
rity challenge resulted from the easy-to-access WiFi networks and propose a distributed
security monitoring system for wireless network forensics. Then, I explore how WiFi
networks can be used to improve home security. Specifically, I aim to design and de-
velop a low-cost home security system to detect unexpected physical intruders based on
existing WiFi infrastructures. In the second part of this proposal, I focus on analyzing the
big data that wireless networks produce. Specifically, I investigate effective and efficient
approaches in analyzing network data generated by smartphone apps and wireless sensor
networks.
48
REFERENCES
[1] Cisco visual networking index: Forecast and methodology, 20132018.http://www.cisco.com/c/en/us/solutions/collateral/
service-provider/ip-ngn-ip-next-generation-network/
white_paper_c11-481360.pdf.
[2] Euclid analytics. http://euclidanalytics.com/.
[3] Haversine formula.http://en.wikipedia.org/wiki/haversine_formula.
[4] Kismet. https://www.kismetwireless.net/.
[5] Motes dataset. http://db.csail.mit.edu/labdata/labdata.html.
[6] Number of landline subscribers.http://www.itu.int/en/ITU-D/Statistics/Documents/
statistics/2014/Fixed_tel_2000-2013.xls.
[7] Presence orb. http://www.presenceorb.com/.
[8] This recycling bin is following you. http://qz.com/112873/this-recycling-bin-is-following-you/.
[9] tshark.https://www.wireshark.org/docs/man-pages/tshark.html.
[10] Turnstyle solutions. https://www.getturnstyle.com/.
[11] Wireless sensor network.http://en.wikipedia.org/wiki/Wireless_sensor_network.
[12] Paramvir Bahl, Jitendra Padhye, Lenin Ravindranath, Manpreet Singh, AlecWolman, and Brian Zill. Dair: A framework for managing enterprise wirelessnetworks using desktop infrastructure. In Proceedings of the Annual ACMWorkshop on Hot Topics in Networks (HotNets), 2005.
[13] Paramvir Bahl and Venkata N Padmanabhan. Radar: An in-building rf-based userlocation and tracking system. In INFOCOM 2000. Nineteenth Annual JointConference of the IEEE Computer and Communications Societies. Proceedings.IEEE, volume 2, pages 775–784. Ieee, 2000.
49
50
[14] Anand Balachandran, Geoffrey M. Voelker, Paramvir Bahl, and P. Venkat Rangan.Characterizing user behavior and network performance in a public wireless lan.SIGMETRICS Perform. Eval. Rev., 30(1):195–205, June 2002.
[15] Robert Bell, Yehuda Koren, and Chris Volinsky. Modeling relationships at multiplescales to improve accuracy of large recommender systems. In KDD, pages 95–104.ACM, 2007.
[16] Christopher M Bishop et al. Pattern recognition and machine learning, volume 1.springer New York, 2006.
[17] Yongjie Cai and Ping Ji. A measurement study for understanding wireless forensicmonitoring. In ICDFI, 2012.
[18] Xi Chen, Andrea Edelstein, Yunpeng Li, Mark Coates, Michael Rabbat, andAidong Men. Sequential monte carlo for simultaneous passive device-free trackingand sensor localization using received signal strength measurements. InInformation Processing in Sensor Networks (IPSN), 2011 10th InternationalConference on, pages 342–353. IEEE, 2011.
[19] Yu-Chung Cheng, John Bellardo, Peter Benko, Alex C Snoeren, Geoffrey MVoelker, and Stefan Savage. Jigsaw: solving the puzzle of enterprise 802.11analysis, volume 36. ACM, 2006.
[20] Yu-Chung Cheng, Yatin Chawathe, Anthony LaMarca, and John Krumm.Accuracy characterization for metropolitan-scale wi-fi localization. In Proceedingsof the 3rd international conference on Mobile systems, applications, and services,pages 233–245. ACM, 2005.
[21] Krishna Chintalapudi, Anand Padmanabha Iyer, and Venkata N Padmanabhan.Indoor localization without the pain. In Proceedings of the sixteenth annualinternational conference on Mobile computing and networking, pages 173–184.ACM, 2010.
[22] Trinh Minh Tri Do, Jan Blom, and Daniel Gatica-Perez. Smartphone usage in thewild: a large-scale analysis of applications and context. In Proceedings of the 13thinternational conference on multimodal interfaces, ICMI ’11, pages 353–360, NewYork, NY, USA, 2011. ACM.
[23] David Eckhardt and Peter Steenkiste. Measurement and analysis of the errorcharacteristics of an in-building wireless network. In ACM SIGCOMM Computercommunication review, volume 26, pages 243–254. ACM, 1996.
[24] Hossein Falaki, Dimitrios Lymberopoulos, Ratul Mahajan, Srikanth Kandula, andDeborah Estrin. A first look at traffic on smartphones. In Proceedings of the 10thACM SIGCOMM conference on Internet measurement, IMC ’10, pages 281–287,New York, NY, USA, 2010. ACM.
51
[25] Hossein Falaki, Ratul Mahajan, Srikanth Kandula, Dimitrios Lymberopoulos,Ramesh Govindan, and Deborah Estrin. Diversity in smartphone usage. InProceedings of the 8th international conference on Mobile systems, applications,and services, MobiSys ’10, pages 179–194, New York, NY, USA, 2010. ACM.
[26] Camden C Ho, Krishna N Ramachandran, Kevin C Almeroth, and Elizabeth MBelding-Royer. A scalable framework for wireless network monitoring. InProceedings of the 2nd ACM international workshop on Wireless mobileapplications and services on WLAN hotspots, pages 93–101. ACM, 2004.
[27] Jun Hou, Chengdong Wu, Zhongjia Yuan, Jiyuan Tan, Qiaoqiao Wang, and YunZhou. Research of intelligent home security surveillance system based on zigbee.In Intelligent Information Technology Application Workshops, 2008. IITAW’08.International Symposium on, pages 554–557. IEEE, 2008.
[28] Meng Jiang, Peng Cui, Rui Liu, Qiang Yang, Fei Wang, Wenwu Zhu, and ShiqiangYang. Social contextual recommendation. In CIKM, pages 45–54, New York, NY,USA, 2012. ACM.
[29] Yoon-Gu Kim, Han-Kil Kim, Suk-Gyu Lee, and Ki-Dong Lee. Ubiquitous homesecurity robot based on sensor network. In Proceedings of the IEEE/WIC/ACMInternational Conference on Intelligent Agent Technology, IAT ’06, pages700–704, Washington, DC, USA, 2006. IEEE Computer Society.
[30] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniquesfor recommender systems. Computer, 42(8):30–37, 2009.
[31] Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. Dynammo:Mining and summarization of coevolving sequences with missing values. In KDD,pages 507–516. ACM, 2009.
[32] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Sorec: socialrecommendation using probabilistic matrix factorization. In CIKM, pages931–940. ACM, 2008.
[33] Ratul Mahajan, Maya Rodrig, David Wetherall, and John Zahorjan. Analyzing themac-level behavior of wireless networks in the wild. In ACM SIGCOMMComputer Communication Review, volume 36, pages 75–86. ACM, 2006.
[34] Gregor Maier, Fabian Schneider, and Anja Feldmann. A first look at mobilehand-held device traffic. In Proceedings of the 11th international conference onPassive and active measurement, PAM’10, pages 161–170, Berlin, Heidelberg,2010. Springer-Verlag.
[35] Andriy Mnih and Ruslan Salakhutdinov. Probabilistic matrix factorization. InAdvances in neural information processing systems, pages 1257–1264, 2007.
52
[36] Neal Patwari and Joey Wilson. Rf sensor networks for device-free localization:Measurements, models, and algorithms. Proceedings of the IEEE,98(11):1961–1973, 2010.
[37] Nissanka Bodhi Priyantha. The cricket indoor location system. PhD thesis,Massachusetts Institute of Technology, 2005.
[38] Feng Qian, Zhaoguang Wang, Alexandre Gerber, Zhuoqing Mao, Subhabrata Sen,and Oliver Spatscheck. Profiling resource usage for mobile applications: across-layer approach. In Proceedings of the 9th international conference on Mobilesystems, applications, and services, MobiSys ’11, pages 321–334, New York, NY,USA, 2011. ACM.
[39] Mika Raento, Antti Oulasvirta, and Nathan Eagle. Smartphones: An EmergingTool for Social Scientists. Sociological Methods Research, 37(3):426–454,February 2009.
[40] Theodore S Rappaport et al. Wireless communications: principles and practice,volume 2. prentice hall PTR New Jersey, 1996.
[41] Herbert E Rauch, CT Striebel, and F Tung. Maximum likelihood estimates oflinear dynamic systems. AIAA journal, 3(8):1445–1450, 1965.
[42] David Schwab and Rick Bunt. Characterising the use of a campus wirelessnetwork. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEEComputer and Communications Societies, volume 2, pages 862–870. IEEE, 2004.
[43] M Zubair Shafiq, Lusheng Ji, Alex X Liu, and Jia Wang. Characterizing andmodeling internet traffic dynamics of cellular devices. In Proceedings of the ACMSIGMETRICS joint international conference on Measurement and modeling ofcomputer systems, pages 305–316. ACM, 2011.
[44] John Z Sun, Kush R Varshney, and Karthik Subbian. Dynamic matrix factorization:A state space approach. In ICASSP, pages 1897–1900. IEEE, 2012.
[45] Diane Tang and Mary Baker. Analysis of a local-area wireless network. InProceedings of the 6th annual international conference on Mobile computing andnetworking, pages 1–10. ACM, 2000.
[46] Ionut Trestian, Supranamaya Ranjan, Aleksandar Kuzmanovic, and Antonio Nucci.Measuring serendipity: connecting people, locations and interests in a mobile 3gnetwork. In Proceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference, IMC ’09, pages 267–279, New York, NY, USA, 2009.ACM.
[47] Chieh-Chih Wang, Charles Thorpe, Sebastian Thrun, Martial Hebert, and HughDurrant-Whyte. Simultaneous localization, mapping and moving object tracking.The International Journal of Robotics Research, 26(9):889–916, 2007.
53
[48] Joey Wilson and Neal Patwari. Radio tomographic imaging with wirelessnetworks. Mobile Computing, IEEE Transactions on, 9(5):621–632, 2010.
[49] Joey Wilson and Neal Patwari. See-through walls: Motion tracking usingvariance-based radio tomography networks. Mobile Computing, IEEE Transactionson, 10(5):612–621, 2011.
[50] Qiang Xu, Jeffrey Erman, Alexandre Gerber, Zhuoqing Mao, Jeffrey Pang, andShobha Venkataraman. Identifying diverse usage behaviors of smartphone apps. InProceedings of the 2011 ACM SIGCOMM conference on Internet measurementconference, IMC ’11, pages 329–344, New York, NY, USA, 2011. ACM.
[51] Yang Zhao, Neal Patwari, Jeff M Phillips, and Suresh Venkatasubramanian. Radiotomographic imaging and tracking of stationary and moving people via kerneldistance. In Proceedings of the 12th international conference on Informationprocessing in sensor networks, pages 229–240. ACM, 2013.