baydogan time series data mining - Home | Living Analytics ...
Transcript of baydogan time series data mining - Home | Living Analytics ...
•BICRITERIA OPTIMIZATION OF ENERGY EFFICIENT PLACEMENT AND ROUTING IN HETEROGENOUS WIRELESS SENSOR NETWORKS
•TIME SERIES DATA MINING
1Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
Mustafa Gokce Baydogan
PhD Candidate
School of Computing, Informatics and Decision Systems Engineering
Arizona State University (ASU) Tempe, AZ, USA
•TIME SERIES DATA MINING
BICRITERIA OPTIMIZATION OF ENERGY EFFICIENT PLACEMENT AND ROUTING IN HETEROGENOUS
WIRELESS SENSOR NETWORKS
Mustafa Gökçe Baydoğan
School of Computing, Informatics and Decision Systems Engineering
2Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
Arizona State University (ASU) Tempe, AZ, USA
Nur Evin Özdemirel, PhD
Department of Industrial Engineering
Middle East Technical University (METU) Ankara, Turkey
MOTIVATION
SOCIOECONOMIC
– Environmental monitoring
– Air, soil or water monitoring
– Habibat monitoring
– Seismic detection
– Military surveillance
3Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– Battlefield monitoring
– Sniper localization
– Nuclear, biological or chemical attack detection
– Disaster area monitoring
RESEARCH0
DESIGN ISSUES IN WSNs
Deployment
random vs deterministic; one-time vs iterative
Mobility
mobile vs immobile
Heterogeneity
homogeneous vs heterogeneus
4Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
Communication modality
radio vs light vs sound
Infrastructure
infrastructure vs ad hoc
Network Topology
single-hop vs star vs tree vs mesh
Römer and Mattern, 2004, The Design Space of Wireless Sensor Networks, IEEE Wireless Communications, 11:6, 54-6
PROBLEM CHARACTERISTICS
There are some events (targets) to be sensed in the monitoring area
Sink
Locate sensors (battery powered) to possible locations so that events are sensed(detected) with a
given probability
Determine the rate of data flow between sensors and sink node (base station)
5Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
TOTAL SENSOR
COST
NETWORK LIFETIME
PROBLEM DEFINITION
OBJECTIVES
– Minimize total cost of sensors deployed
– Maximize lifetime of the network
DECISIONS
– Location of heterogeneous sensors
6Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– Data routing
CONSTRAINTS
– Connectivity
– Node (sensor) and channel (link) capacity
– Coverage
– Battery power
LITERATURE
7Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
PROBLEM DEFINITIONp
CONNECTIVITY
A sensor of type k located at location i can communicate with a sensor of type k located at
location j if ( )jki
k crcr ,mindist ij ≤
i
kcr i
kcr
8Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
ijdist j
kcr
ji
(a) (b)
j
kcr
j
ijdist
i
PROBLEM DEFINITIONp
≤
=−
otherwise ,0
dist if , ip k
dist
ikp
srepr
ipkβ Strength of the sensor signal
decreases as distance increases†
COVERAGE
Denoted as the detection probability of a target at point
By a sensor of type k located at location
p
i
9Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
otherwise ,0
† Zou and Chakrabarty, 2005, A Distributed Coverage- and Connectivity- Centric Technique for Selecting Active Nodes in
Wireless Sensor Networks, IEEE Transactions on Computers
( )( )∏∈
−−=p
ik
Bki
x
ikppprpr
,
11
Detection probability of a target at point p
PROBLEM DEFINITION
ENERGY CONSUMPTION MODEL †
Sources of energy consumption in a sensor
– Generating data
– Receiving data
β=
γ=keg
10Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– Transmitting data
† J. Tang, B. Hao, and A. Sen, 2006, Relay node placement in large scale wireless sensor networks, Computer Communications,
29:4, 490-501
m
ijij distet *λδ +=
β=ijer
δ is a distance-independent constant term
λ is a coefficient term associated with the distance dependent term
ijdist is the distance between two locations
m is the path loss index
PROBLEM FORMULATION
total cost of sensors located
11Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
lifetime of the network
one sensor can be located at each location
PROBLEM FORMULATION
data flow balance at a sensor
all data is routed to sink node
12Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
sensor capacity
channel (link) capacity
coverage
PROBLEM FORMULATION
battery power
13Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
location decision data flow decision
battery power
THE BICRITERIA PROBLEM
DOMINATION
dominates ( )'2'
1 , zz ( )''2''
1 , zz
and ''
1
'
1zz ≤if
''
2
'
2zz > or and
''
1
'
1zz < ''
2
'
2zz ≥
18
20
14Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
10 12 14 16 18 20 22 24 266
8
10
12
14
16
Sensor Cost
Netw
ork
Lifetim
e
FINDING PARETO OPTIMAL SOLUTIONS
A BICRITERIA PROBLEM
10
12
14
16
18
20
Netw
ork
Lifetim
e
Solve for to find lower bound on cost1z
Solve for
s.t.
to find lower bound on lifetime
1cost z≤
2z
Solve for to find upper bound on lifetime2z
15Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
10 12 14 16 18 20 22 24 266
8
10
Sensor Cost
Ne Solve for to find upper bound on lifetime2
Solve for
s.t.
to find upper bound on cost
2lifetime z≥
1z
For all integral values
solve for
1z
2z
GENETIC ALGORITHM
Why evolutionary algorithms?
• Classical search and optimization methods– find single solution in every iteration
– need repetitive use of a single objective optimization method
– assumptions like linearity, continuity
• Evolutionary Algorithms – use a population of solutions in every generation
16Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– use a population of solutions in every generation
– no assumptions
– find and maintain multiple good solutions • Emphasize all nondominated solutions in a population equally
• Preserve a diverse set of multiple nondominated solutions
�Near optimal, uniformly disributed, well extended set of solutions for MO problems
Nondominated sorting approach (Goldberg, 1989)
GENETIC ALGORITHM
Convergence to Pareto optimal front
17Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
Diverse set of solution along Pareto
optimal front
GENETIC ALGORITHM
REPRESENTATION
type of the sensor located on the corresponding location
Disadvantages
– Flow allocation is not stored
– Lifetime cannot be determined
0 1 2 ------- 1 3 ------- 0 3 0
i 1+i n1 1−n2−n2 3
18Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– Lifetime cannot be determined
– Finding feasible solutions after mutation and crossover operators is very hard
Advantages
– Problem reduces to LP with given sensor locations
– By solving LP, maximum lifetime and constraint violations can be determined
FITNESS
Based on nondominated sorting idea
considering three objectives
– Total sensor cost
– Network lifetime
– Overall constraint violation
• Connectivity
GENETIC ALGORITHM
19Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
• Connectivity
• Coverage
• Capacity violations
(channel and sensor)
GENETIC ALGORITHM
INITIAL POPULATION GENERATION
– Two phase approach
– Sensor location
– Location according to target coordinates
– Relay location
– Location according to sensor coordinates
20Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
MUTATION
– Repair and improve
– Repair coverage constraints
– Improve cost and lifetime objectives
– Repair connectivity constraints
– Improve cost objective
GENETIC ALGORITHM
21Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
Small problems
– Problems with 24 possible locations
– Problems with 40 possible locations
TEST PROBLEMS
25 possible locations
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
PS2441 possible locations
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
PS40BS BS
22Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
50 targets are dispersed across the monitoring area
Each target has a random coverage threshold uniformly distributed between 0.7 and 1
The rate of data generated for each target is a random integer between 1 Kbps and 3 Kbps
COMPUTATIONAL RESULTS
PERFORMANCE MEASURES
Proximity Indicator (PI)
For each solution found, find the Pareto
optimal solution with closest normalized
Tchebychev distance
Reverse Proximity Indicator (RPI)
23Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
For each Pareto optimal solution, find the
solution with closest normalized Tchebychev
distance
Hypervolume Indicator (HI)
Find the ratio of area bounded by nadir point
that cannot be covered
COMPUTATIONAL RESULTS
Smaller Problems
Problem
size
Constraint
tightness
# of feasible
problems RPI PI HI ε -constraint GA GA=Exact ε -constraint GA
PS24 LC 30/30 0.0317 0.0220 0.0558 10.20 9.27 3.70 1118 110
TC 29/30 0.0761 0.0574 0.1734 7.47 6.57 1.93 70 110
PS40 LC(1)
10/10 0.0464 0.0489 0.1164 13.60 12.80 1.30 100088 798
LC(2)
20/20 - - - 13.00 14.20 - 85983 821
TC 30/30 0.0744 0.0780 0.1957 11.60 11.10 1.13 12865 797
GA performance measures Number of solutions CPU time (s)
24Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
TC 30/30 0.0744 0.0780 0.1957 11.60 11.10 1.13 12865 797
(2) Results across 20 instances that are solved approximately by the ε -constraint approach.
(1) Results across 10 instances that are solved exactly by the ε -constraint approach.
COMPUTATIONAL RESULTS
25Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
COMPUTATIONAL RESULTS
26Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
TEST PROBLEMS
We also introduce larger test problems
COMPUTATIONAL RESULTS
15
20
15
20
Problems with 99 possible locations Problems with 111 possible locations
27Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
0
5
10
15
0 5 10 15 20
0
5
10
15
0 5 10 15 20
COMPUTATIONAL RESULTS
Larger Problems
28Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
CONCLUSIONCOMMENTS and FURTHER RESEARCH
– GA provides reasonable solution quality with better solution times
� We can obtain better solutions than the exact approach has by representing the
area with more grids (approximation of the continuous space) even with better
solution times
Future research
– Modification of ε-constraint approach
29Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– Modification of ε-constraint approach
– Use of sensitivity analysis results (promising)
– Incorporating decision maker’s preferences
– Different objectives such as minimization of total delay, total hop count or average
path length
– Special network requirements such as K-coverage or K-connectivity
THANK YOU...
QUESTIONS AND COMMENTS?
30Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
QUESTIONS AND COMMENTS?
Supplemental material
31Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
OUTLINE
– MOTIVATION
– PROBLEM DEFINITION
– GENETIC ALGORITHM
32Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– GENETIC ALGORITHM
– COMPUTATIONAL RESULTS
– CONCLUSION
TEST PROBLEMS
Small problems
– Problems with 24 possible locations
– Problems with 40 possible locations
EXACT SOLUTION
25 possible locations
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
PS2441 possible locations
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
PS40BS BS
33Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
50 targets are dispersed across the monitoring area
Each target has a random coverage threshold uniformly distributed between 0.7 and 1
The rate of data generated for each target is a random integer between 1 Kbps and 3 Kbps
Sensor type Sensor type Relay
100 Kpbs 200 Kpbs 150 Kbps
2 m 3 m 0 m
3 m 5 m 3 m
1 2 1
0.15 0.1 -
10-5
EnergyUnits 1.5x10-5
EnergyUnits 10-5
EnergyUnits
kscap
kc
ksr
kcr
kβ
1=k 2=k 3=k
ke
Sensor type Sensor type Relay
40 Kpbs 80 Kpbs 40 Kbps
2 m 3 m 0 m
3 m 5 m 3 m
1 2 1
0.15 0.1 -
10-5
EnergyUnits 1.5x10-5
EnergyUnits 10-5
EnergyUnits
kscap
kc
ksr
kcr
kβ
1=k 2=k 3=k
ke
EXACT SOLUTION
34Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
GENETIC ALGORITHM
Why evolutionary algorithms?
• Classical search and optimization methods– find single solution in every iteration
– need repetitive use of a single objective optimization method
– assumptions like linearity, continuity
• Evolutionary Algorithms – use a population of solutions in every generation
35Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
– use a population of solutions in every generation
– no assumptions
– eliminate the need of parameters (like weight, ε or target vectors)
– find and maintain multiple good solutions • Emphasize all nondominated solutions in a population equally
• Preserve a diverse set of multiple nondominated solutions
�Near optimal, uniformly disributed, well extended set of solutions for MO problems
COMPUTATIONAL RESULTS
36Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
COMPUTATIONAL RESULTS
37Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
CONCLUSION
– RPI and HI worsen as the problem size increases from 24 to 40 when capacity
constraints are loose
– TC instances are harder to solve for the GA compared to LC instances, whereas
they are easier for the ε-constraint approach.
– Performance measures for tight capacity are about twice as large as those for
loose capacity.
– The problem size has less effect on the performance measures when the capacity
constraints are tight.
38Mustafa Gokce Baydogan
Singapore Management University, 5/10/2012
constraints are tight.
– When the capacity constraints are loose, the GA solves problems of size 24 in one
tenth of the ε-constraint CPU times. For problems of size 40, GA CPU time is about
100 times shorter than ε-constraint time.
– For the tight capacity case, GA CPU times are slightly longer than ε-constraint
times with 24 possible locations, but they are 15 times shorter with 40 possible
locations.
– For problems with 99 and 111 possible locations, the GA converges to a solution in
about 160 minutes.
Time series data mining
Mustafa Gokce BaydoganGeorge Runger
5/10/2012
Data mining and Operations Research
What is Data Mining?
� Extracting meaningful, previously unknown patterns or knowledge from large databases
� The knowledge discovery process
2
Define
Objective
Prepare
Data
Mine
Knowledge
Interpret
Results
Data cleaning
Data selection
Attribute selection
Visualization
Classification
Association rule
discovery
Clustering
Business/scientific
objective
Data mining
objective
Predictive models
Structural insights
Interdisciplinary Field
Statistics
3
Databases
Optimization
Machine
LearningData Mining
Time series data mining
What is time series?
� A (numeric) time series is a sequence of
observations of a numeric property over time
-1,25
-1,00
0,01
4
0,01
0,05
…
5,45
0,00
…
Time series data mining
Motivations
� Time series are everywhere
ECG Heartbeat Stock
� Most of the information (data) produced in
a variety of areas are time series
� about 50% of all newspaper graphics are time series
5Images from E. Keogh. A decade of progress in indexing and mining large time series databases. In VLDB, page 1268, 2006.
Time series data mining
Motivations
� Other types of data can be converted to
time series.
� Everything is about the representation.
Example: Recognizing words� Example: Recognizing words
6
An example word “Alexandria” from the dataset of word profiles for
George Washington's manuscripts.
Images from E. Keogh. A quick tour of the datasets for VLDB 2008. In VLDB, 2008.
A word can be represented by
two time series created by moving over and under the word
Time series data mining
Examples� Recognizing trees from the leaf images
� Understanding what is related to difficulty of
certain task
7Images from E. Keogh. A decade of progress in indexing and mining large time series databases. In VLDB, page 1268, 2006.
PUPILDILATION
EEG
EMOTIONS
ClusteringClustering ClassificationClassification
Time Series Data Mining Tasks
8
0 50 0 1000 150 0 2000 2500
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
A B C
A B C
Query by ContentRule Discovery
⇒s = 0.5
c = 0.3
Motif DiscoveryMotif Discovery
Anomaly DetectionAnomaly Detection VisualizationVisualization
10
Time series classification� A supervised learning problem aimed at
labeling temporally structured univariate
(or multivariate) sequences of certain (or
variable) length.
9
Datasets� Datasets are from different domains
Word
recognition
MedicineEnergy
Biology
10
Face
recognition
Image and video
classification
Energy
Robotics
Gesture
recognition
Astronomy
A Bag-of-Features Framework to
classify time series (TSBF)
� Bag of features � a common method used for image classification
� also referred as� Bag of words in document analysis� Bag of frames in audio and speech recognition
� Accurate even with simple shape based features� Accurate even with simple shape based features
� SBF provides a framework for time series classification, alternative algorithms for the following tasks may provide better solutions� Local feature extraction� Codebook generation
� Classification
11
The details and the code of TSBF and the datasets are provided inhttp://www.mustafabaydogan.com/research/time-series-classification.html
Supervised Time Series Pattern Discovery
through Local Importance (TS-PD)
� TS-PD aims at finding patterns for
interpretability
� TS-PD identifies regions of interests� TS-PD identifies regions of interests
� Provides a visualization tool for understanding underlying relations
� Fast approach to detect the local
information related to the classification
12
The details and the code of TS-PD and the datasets will be provided in
http://www.mustafabaydogan.com/research/time-series-pattern-discovery.html soon
TS-PD
Example
� Extending TS-PD to multivariate time series
classification
� Gesture recognition task [12]
� Acceleration of hand on x, y and z axis
� Classify gestures (8 different types of gestures)
13
TS-PD
Example
14
Using DM as a tool� Decision makers are interested in knowledge that permits them to do their jobs better by taking some specific actions in response to the newly discovered knowledge.
� Usually a data mining algorithm is executed first and then profitable actions are determined based on the results from the data mining� Example: Market basket analysis
� Association rule mining to decide location of items in the supermarket
15
Using DM as a tool
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
� Put diapers and beer on the same shelf???
16
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Implication means co-
occurrence, not causality!
Using DM as a tool
� Root cause analysis in networks
� Supply chain networks
� Identify corrupt nodes and their relations
� Why are my deliveries late?
17
� Why are my deliveries late?
Using DM as a tool� Transaction data
� Several factors affecting the network
id Stage 1 Stage 2 Stage 3 Stage 4
Weather
status
at stage 1 ….
Road status
between stage
1 and stage 2 ….
Transportation
vehicle
between stage
1 and stage 2 Weight Delayed?
1 S2 P1 D2 C2 Sunny … good …. Plane 30lbs Yes
2 S5 P3 D4 C1 Rainy … bad … Truck 40lbs No
18
2 S5 P3 D4 C1 Rainy … bad … Truck 40lbs No
3 . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
N . . . . . . . . . . .
Using DM as a tool� DM is required since
� Data is high dimensional� There may be missing values in the data� Not all indicators are numerical
� Identify interaction between the network nodes to find out the causes of delayto find out the causes of delay� What decisions are causing delay?
� Take actions� Modification of the optimization algorithm
� Introduce constraints based on the learning (data mining result)
� Simulation to generate more data� Further analysis of simulated data
19
Future directions� Reinforcement learning
� The decision-maker recognizes her state within the environment, and reacts by initiating an action.
� Example application: dynamic pricing� Example application: dynamic pricing
� Consequently she obtains a reward signal, and enters another state
20
Future directions
� The mechanism that generates reward signals and introduces new states is referred to as the dynamics of the environment.
� Agent is unfamiliar with dynamics of the � Agent is unfamiliar with dynamics of the environment and therefore initially it cannot correctly predict the outcome of actions.
� As the agent interacts with the environment and observes the actual consequences of its decisions, it can gradually adapt its behavior accordingly
21
Future directions� Dynamic programming is widely used to solve this problem� However environment can be highly unpredictable
� Modeling the environment efficiently is a important†
� provides knowledge about the domain that produced the datadata
� Revisiting dynamic pricing problem� In game-theoretic setting, all players are assumed to be rational but is that true?
� predicting opponent’s proposed price in advance (reduce uncertainty in the environment)
� Another example� If we know that a certain pattern observed in the stock price lead to high profit under certain conditions in the past, this may be important in taking actions
22
†L. Busoniu, R. Babuska, and B. De Schutter, "A comprehensive survey of multi-agent reinforcement learning," IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 38, no. 2,
pp. 156-172, Mar. 2008.
Thanks for your patience!
Questions and comments?
23
Supplemental material
24
DM and OR
� Using OR for DM� Optimization algorithms used for DM
� Data visualization
� Attribute selection
� Classification
Unsupervised learning� Unsupervised learning
� Using DM as a tool for decision making� Data mining can be used to complement traditional OR methods in many areas
� Discover patterns for actionable knowledge
� Example applications:
� Supply chain management (e.g., finding corrupt nodes)
25
Using OR for DM
Data Visualization
� Visualizing the data is important in any data mining project
� Generally difficult because the data is always high-dimensional, i.e., hundreds or thousands of attributes (variables)
26
thousands of attributes (variables)
� How can we best visualize such data in 2 or 3 dimensions?
� Traditional techniques include multidimensional scaling, which uses nonlinear optimization
Optimization Formulation� Combinatorial optimization formulation by Abbiw-Jackson,
Golden, Raghavan, and Wasil (2004)
� Map a set M of m points from Rr to Rq, q = 2,3
� Approximate the q-dimensional space by a lattice N
[ ]),(),,(min ∑∑∑∑∈ ∈ ∈ ∈Mi Mj Nk Nl
jlikneworiginal xxlkdjidF
27
1,0
,1s.t.
1
=
∈∀=∑∈
∈>∈ ∈ ∈
ik
Nk
ik
MijMj Nk Nl
x
Mix
etc map,Sammon square,least assuch Function
in measure Distance),(
in measure Distance),(
F
lkd
jidq
new
r
original
R
R
Optimization Formulation
� Quadratic Assignment Problem (QAP)
� Not possible to solve exactly for large
scale problemsscale problems
� Local search procedure proposed
28
Using OR for DM
Data Clustering
� Identify natural clusters or groupings of
data instances
� Many possible set of clusters
29
� Many possible set of clusters
� What makes a set of clusters good?
� Minimize the distance within clusters
� Maximize the distance between clusters
Optimization Formulation� k-medoid clustering
� Select k points to be the cluster center
� Assign other points to the clusters so that within cluster distance is minimized
� interpoint distances
30
� interpoint distances
References
31
References (continued)
32