Talis Vertriest Urban Data Mining applied to sound sensor networks
Transcript of Talis Vertriest Urban Data Mining applied to sound sensor networks
Talis Vertriest
Urban Data Mining applied to sound sensor networks
Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Daniël De ZutterDepartment of Information Technology
Master of Science in Industrial Engineering and Operations ResearchMaster's dissertation submitted in order to obtain the academic degree of
Counsellor: Prof. dr. ir. Bert De CoenselSupervisor: Prof. dr. ir. Dick Botteldooren
I
"The author gives permission to make this master dissertation available for
consultation and to copy parts of this master dissertation for personal use. In all
cases of other use, the copyright terms have to be respected, in particular with regard
to the obligation to state explicitly the source when quoting results from this master
dissertation."
(June 1, 2016)
II
Preface
While my name is on the front cover of this thesis, I am by no mean its sole
contributor. There are a number of people behind this project who deserve to be both
acknowledged and thanked: committed supervisors, generous friends and a warm and
supportive family.
I would like to thank my thesis committee members, Professor Dick Botteldooren and
Professor Bert De Coensel for their guidance and unrelenting support through this
process. Both have routinely shared their passion and knowledge, which are to the
great benefit of this thesis.
I thank all my friends and in particular my friend and Post Doc Scientist in
Neurosciences Dr. Ken Veys for his valuable contribution and advice. This research
looks very different because of his expertise, technical help, and the additional
computation force of his MacBook pro.
I would like to express my deepest appreciation for my parents who did everything in
their power and beyond to fire fight my worries, concerns and anxieties, and have
worked to instill great confidence in both myself and my work.
Most importantly of all, I feel blessed that I am able to accomplish this thesis and
show intensive gratitude towards Mother Nature and all the people that contributed to
who I am today. In the same vein, I would like to extend great thanks to the
University of Ghent, to all professors and to everybody involved in the education of
our society.
III
Abstract Title: URBAN DATA MINING APPLIED TO SOUND SENSOR
NETWORKS
Name: TALIS VERTRIEST
Supervisor: PROF. DR. IR. DICK BOTTELDOOREN
Counsellor: PROF. DR. IR. BERT DE COENSEL
Degree: MASTER OF SCIENCE
Major Field: INDUSTRIAL ENGINEERING AND OPERATIONS
RESEARCH
Department: INFORMATION TECHNOLOGY
Faculty: ENGINEERING AND ARCHITECTURE
University: UNIVERSITY OF GHENT
Academic year: 2015-2016
Almost every activity or event produces sound patterns, making sound a valuable
source of information in the analysis of environments. As one of our senses, sound
directly contributes to the human perception of places. Solely from sound information,
people are able to distinguish danger from safe situations, unusual events from normal
activity. This thesis designs a program in the attempt to detect those (point) anomalies
that people would define as abnormal as well as contextual anomalies, which are less
obvious for human perception.
Raw audio signals are not suitable as a direct input to a classifier. As a consequence,
the data is transformed into a representation that lends itself to successful learning,
known as feature extraction. This thesis focuses on unsupervised learning because of
its multi applicable character. More specifically it applies data exploration rather than
field knowledge for feature extraction. Spectrograms are treated as series of
meaningless numbers instead of audio representative digits. Gaussian Mixture Models
describe the data per minute and its parameters define the features.
Whether or not supplemented with time features, those Gaussian features are the key
ingredient for the feature vectors. For classification, feature vectors are again being
clustered, using different techniques and dimensionalities, depending on the type of
anomaly that is searched for (point, contextual or conceptual). Conceptual anomalies
are beyond the scope of this thesis, but for point anomalies as well as contextual
anomalies, Gaussian Mixture Models outperform intelligent Kmeans and form the
basic clustering technique for this research.
In parallel, a more known but rather classical method is applied, based on spectral
features. Both techniques are compared based on their computational intensity and
results, revealing the qualities of the newly designed technique based on data
exploration for feature extraction and unsupervised learning for classification.
Key words: Sound sensors - Anomaly detection - Data exploration - Unsupervised
Learning - Gaussian Mixture Models
IV
Extended Abstract
Urban Data Mining applied to sound sensor networks
Talis Vertriest
Supervisor(s): Dick Botteldooren, Bert De Coensel
Abstract This thesis develops a system that detects unusual
situations on time, inspired by the respective property that
humans exhibit in their everyday life quite effortless. It therefore
uses audio information from sound sensors as its sole input. The
technique used for Feature Extraction is Data Exploration. More
specifically, a Gaussian Mixture Model describes the timeframe
of one minute without overlap, thereby capturing temporal as
well as spectral relations into one single model. The parameters
of these GMM's form the key ingredients for feature vectors.
These can be added up with additional time features, depending
on the type of anomaly that is sought after. Three types of
anomalies can be distinguished; Unusual Events, Unusual
Minutes and Contextual Anomalies, with the focus on the former
type. Classification is based on Unsupervised Machine Learning:
a GMM classifier clusters the feature vectors that are, once
suspected anomalous, subject to human supervision for labelling.
The characteristics of the labelled samples are again learned by
the system, reducing human supervision over time and
converging to a zero total error rate. A Linear Program defines
the optimal mutual error ratio.
Keywords Sound sensors, Data Exploration, Gaussian
Mixture Model, Anomaly Detection, Unsupervised Machine
Learning, Error Ratio
I. INTRODUCTION
Today, more than half of the world’s population lives in
urban areas, highlighting the need to improve urban
environments. Among human senses, hearing is second only
to vision and has additional advantages according to
complexity and versatility, making it it optimal manner for
understanding urban and conceptual settings.
The main focus of audio classification systems is Speech
Recognition (SR), because of the many evident application
fields and its well-defined area of content. Speech
Recognition has traditionally relied on field knowledge for
feature extraction, a popular choice being the Mel Frequency
Cepstral Coefficients (MFCC).
In the past years, with the rapid growth of technology, the
awareness of the many useful applications of audio
classification grew and Environmental Sound Recognition
(ESR) gained attention. However, the hand-crafted features
that are successful for Speech Recognition perform poorly for
noisy environments, and the urge for new techniques rose.
The advantage of sound recognition is that only certain events
are sought after and those are usually provided as labelled
samples. The features to be extracted can thus be studied in
relation to their content. This is called semi-supervised
learning.
T. Vertriest is with the Information Technology Department, Ghent
University (UGent), Gent, Belgium. E-mail: [email protected] .
Environmental Anomaly Detection is a much broader
research and its abilities go beyond those of the previous
bespoken recognition based researches. While speech
recognition and sound event recognition only search for
certain real-time events, they reject all other unclassified data
as useless, while in in anomaly detection the whole dataset is
now of interest, whether including a certain event or not. The
reason for this is the input data. Urban soundscapes, in
contrary to speech or specific audio events, capture an almost
unlimited variety of sounds with a very high level of noise.
Furthermore, many sources interfere simultaneously. The
audio input can thus not be compared anymore with a
taxonomy of possible contents, or in other words, no labelled
data is available anymore. Every signal is part of the system
and helps defining whether new incoming data are normal or
abnormal, accumulating to Big Data.
II. APPROACHES FOR ANOMALY DETECTION IN BIG SOUND
DATA
A. Anomaly Types
Three types of anomalies are distinguished, based on what a
human supervisor would notify as anomalous. Unusual events
are single unusual events, capturing only a certain frequency
range in a certain time frame, e.g. a gunshot, a thunderstorm.
Unusual timespans (one minute) contain a strange
combination of possibly normal events, e.g. someone playing
music during a heavy storm. Contextual anomalies describe an
unusual moment rather than content. Children playing on the
street are for example frequently occurring, but not during the
middle of the night. The focus of this research is on unusual
events, however also the search engines for the other
anomalies are also set up.
B. Methodology
With the focus of audio classification systems on
recognition, where a form of labelled data is available, the
traditional path for feature extraction is by data knowledge in
the form of Band filters, spectral and temporal features. For
classification, the provision of labelled data allows Supervised
Machine Learning, or Semi-supervised Machine Learning
when labelled data is replaced by a well-informed supervisor.
For this research, no labelled data is available and because
the audio data is irreversibly transformed to spectrograms by a
Discrete Fourrier Transform (DFT), direct and accurate
supervision is inconceivable. Feature extraction is therefore
based on data exploration and the classification process
happens through Unsupervised Machine learning. In parallel,
a classical approach based on spectral features is run for
V
comparison and to obtain a better insights for decision
making.
Figure 1: Classification Methodology
C. Related work
Research depends either on the extraction of classically
known low-level features, or there is labelled data available so
features can be derived by data exploration. The major part of
research even combines both field knowledge for feature
extraction and supervised learning for classification. It lays in
men's nature to apply knowledge rather than to dive into the
unknown and furthermore, it performs quite accurately for the
fields that are most attractive and thus received most attention:
speech and music. Ntalampiras et al [1] for example rely on
MFCC's and have labelled data available. Their evident
conclusion is that it only works accurately for speech involved
samples. Radhakrishnan et al [2] also rely on MFCC's for
feature extraction and have samples of 'normal' audio samples,
in the search for anomalies. This research already belongs to
the semi-supervised category because it is now unknown what
is looked after. Data exploration for feature extraction, as well
as unsupervised classification, is still in its infancy. A
combination of both seems even inexistent and this is why this
thesis is unique, because it combines data exploration with
unsupervised classification. The reason is very simple,
standard low-level features have been proven to be ineffective
for environmental audio signals, there is little to no additional
information about what types of features could be significant,
there is no labelled data available and there is no possibility to
create them by human supervising. Interesting work for
feature extraction is the idea of scattering, done by Salomon
and Bello, applied in two of their papers [3] [4]. Although
they start from the Mel spectrum, it is an interesting start point
to discover features. Also Cai, Lu et al [5] scatter the feature
vectors and apply statistical parameters. For this thesis,
instead of scattering low-level features, the spectrograms
could be scattered and described by a more complex statistical
model instead of basic parameters. Classification inspiration
comes from the same papers [4] [5] [6] for their use of K-
means, but especially the use of GMM's is attractive for both
feature extraction as well as classification, inspired by
Ntalampiras et al [1].
III. PROPOSED MODEL: GMM
A. Feature Selection
By scattering different spectrograms in one single graph,
relations between data points in frequency as well as in time
are pictured. A Gaussian Mixture Model of five components
describes 480 subsequent spectrograms (1-minute), capturing
spectral and temporal features in one model. It makes use of
the known fact that data points in vicinity of time or
frequency, do not differ a lot from each other.
Figure 2 & 3: GMM per 480 spectrograms
Five parameters describe each of the five Gaussian
components: two are allocated to the mean and three to the
covariance. Each minute is thus described by a feature vector
of 25 digits, a significant data reduction of 600 times.
B. Classification
The classification technique depends on the type of
anomaly. At first, unusual events are in fact unusual Gaussian
components. When plotting the mean values of all Gaussian
components, their histogram suggests a GMM classifier.
Different dimensions have been tested; 2D clustering only the
mean values, 5D clustering the mean values and covariances
and 5D standardized, whereby mean and covariances are
rescaled. The not rescaled 5D method performs the best. The
feature vectors are thus clustered by a 5D GMM classifier and
Gaussian components with a low probability density
according to the built model, are alerted anomalous.
Figure 3: Histogram mean values of all Gaussian components
Unusual minutes are detected by statistical approach. Each
minute is described by a feature vector of five digits, the
cluster numbers of its containing Gaussian components. Two
different measures can be calculated per feature vector; the
joint probability and the joint correlation, depending on a non-
scientifically proven point of view. Joint probability assumes
the Gaussian components or underlying audio events
independent from each other while Joint correlation assumes
them dependent. In case one of the two values is lower than a
prescribed threshold, the minute is alerted as anomalous.
Contextual anomalous count an additional continuous time
feature, representing the hour of the day. The feature vectors
per minute consist out of 12 digits, five times the cluster
VI
centroid mean values of the containing Gaussian components
and two clock coordinates for the time feature. Clustering is
done with a 12D GMM.
C. Anomaly Threshold Definition
In order to define one or more thresholds for anomaly
assignment, one must consider the types of errors occurring,
their impact and interaction. A linear program defines the
optimal threshold:
𝑚𝑖𝑛 𝐹𝑁
𝐻∗ 𝐶𝑓𝑛 + 𝐹𝑃 ∗ 𝐶𝑓𝑝 + 𝑇𝑃 ∗ 𝑅𝑡𝑝
or
𝑚𝑖𝑛 𝑓𝐹𝑁(𝑡)
𝐻∗ 𝐶𝑓𝑛 + 𝑓𝐹𝑃(𝑡) ∗ 𝐶𝑓𝑝 + 𝑓𝑇𝑃(𝑡) ∗ 𝑅𝑡𝑝
where: 𝐹𝑁: number of False Negatives
𝐹𝑃: number of False Positives
𝑇𝑃: number of True positives
𝐻: factor for moral damage to human beings
𝐶𝑓𝑛: Cost per false negative
𝐶𝑓𝑝: Cost per false negative
𝑅𝑡𝑝: Revenue per True positive
t : threshold
In order to solve this LP, the functions of threshold, as well
as the precise cost per error must be known. As this is never
the case in reality, different thresholds must be set to learn
those parameters.
The equation above defines the optimal threshold for the
optimal ratio of errors, in a steady state. However, it does not
reduce the total error rate. Therefore, Machine Learning is
applied after the supervision of alerted anomalies by an
authorized person. Assuming that the opinion of the
supervisor is always correct, they are grouped into true
positives and false positives. The characteristics of the false
positives are learned by another GMM classifier, that will be
applied onto newly incoming alerted anomalies, converging to
a zero total error rate.
IV. CLASSICAL APPROACH: SPECTRAL FEATURES
A. Feature Extraction
Each spectrogram is described by nine characteristics, low-
level spectral features; Spectral Energy, Spectral Centroid,
Spectral Spread, Spectral Roll-off Point, Spectral Entropy,
Spectral Kurtosis or flatness, Spectral Skewness, Spectral
Slope and Noisiness. Down sampling to 1/1000 was necessary
because of the excessive size and computation forces needed.
After standardizing the data, Principle Component Analysis
(PCA) decorrelates and recombines these features into pseudo
independent features. Kaiser's stopping rule or otherwise
called the eigenvalue one criterion is applied. With the extra
margin of one, three remaining features are considered
optimal.
B. Classification
A 3D GMM classifier is applied onto the feature vectors,
defining seven clusters. For a valuable comparison with the
newly proposed model, the feature vectors are divided into
timeframes of 480 spectrograms without overlap. Per minute,
the number of anomalous feature vectors are counted and the
minutes ordered in descending number of their anomalous
content.
V. RESULTS NEW APPROACH
Unusual events cannot visually be divided into categories,
they seem to be a misfit concerning the number of Gaussian
components applied. All of these anomalies are reprocessed,
and assigned with their preferred number of components,
using the Akaike Information Criterion. The newly created
feature vectors are classified with the same threshold as
before. Roughly 1/6th of the anomalies is still considered
anomalous, but after supervision, the newly assigned Gaussian
components are not representative for the underlying
anomalous scatter points, for which the reprocessing is
eliminated. The supervision of the anomalies defined by five
Gaussian components results in 52% of true positives, 48% of
false positives, a ratio that remains similar also amongst the
least anomalous of that threshold. This suggests many false
negatives. The threshold must be broadened and the
supervision is of crucial importance, no hard coded threshold
can be set. In addition Machine Learning of false positives
will help convergence to a zero error rate.
Unusual minutes are still due to supervision. Only 35,7% of
anomalies based on joint probability are the same as those
defined by unusual events, and none of those based on joint
correlation.
Contextual anomalies are hard to examine, because a deep
knowledge of the environment is necessary, as all the alerted
anomalies visually appear to be normal.
VI. CONCLUSION AND OUTLOOK
A. Conclusion
Acoustic information is a highly valuable source of
information for environmental context awareness. One of the
main difficulties of this thesis is that the transformed data is
not reversible to its original audio waves, which makes
acoustic supervision impossible. Another difficulty is that the
data is of significant size, calling for computational efficient
techniques and creative thinking. The unsupervised approach
has the advantage that all results are directly originating from
the input data; no other knowledge can be mistakenly applied.
The approach of Gaussian Mixture Models does not only
allow significant data reduction, it also captures both spectral
relations and temporal relations in a single model.
When looking at the results of the unusual events, the
created model fits the data very accurately, and where it does
not, a supervisor helps classifying true positives and false
positives. The latter ones are input for another GMM classifier
that is gradually updated and not only replaces the human
supervisor over time but also reduces the total error rate.
Instead of applying a hard threshold on the nomination of
anomalies, a more intuitive and morally correct technique is
applied. The rate of false positives is initially taken too high
and human supervisor assigns each anomaly with a label:
'false positive' or 'true positive'. The false positives are stored
and their characteristics are learned by the system. This self-
enhancement, also called machine learning, gradually
decreases the rate of false positives and increases the accuracy
of the system.
Besides the significant data reduction, the speed of the
program and the advantages of unsupervised learning, another
advantage of this research is that the developed technique can
VII
be applied on any environment. The technique will learn the
location's specific features and increase accuracy levels with
time.
B. Outlook
The duration of a thesis project allows only a certain
deepness of research, so evidentially, there is room for
improvement.
Conceptual anomalies are not addressed in this research.
The GMM's only encounter small-scale temporal relations, in
between one minute and one day. Although, the evolution of
the environment over time is also very important and could
reveal trends, seasonality, ...
Another interesting topic for future work is to build
taxonomies for different types of environments. Instead of
using a huge training set every time this program is applied
onto a new environment, the knowledge of likewise locations
could be used to converge faster and improve the level of
anomaly accuracy.
ACKNOWLEDGEMENTS
I would like to thank my thesis committee members,
Professor Dick Botteldooren and Professor Bert De Coensel
for their guidance and unrelenting support through this
process. Both have routinely shared their passion and
knowledge, which are to the great benefit of this thesis.
REFERENCES
[1] S. Ntalampiras, I. Potamitis, N. Fakotakis. (s.a.). On acoustic surveillance of hazardous situations. University of Patras, Greece:
department of Electrical and Computer Engineering
[2] R. Radhakrishnan, A. Divakaran, P. Smaragdis. (2005). Audio Analysis for Surveillance Applications. Cambridge: Mitsubishi Electric
Research Labs.
[3] R. Radhakrishnan, A. Divakaran, P. Smaragdis. (2005). Audio analysis for surveillance applications. in IEEE WASPAA’05. pp. 158-161.
[4] J. Salamon, J.P. Bello. (s.a.). Feature learning with deep scattering for
urban sound analysis. Center for urban science and progress. New York University.
[5] R. Cai, L. Lu, A. Hanjalic. (s.a). Unsupervised Content Discovery in
Composite Audio. Delft University of Technology: Department of Mediamatics, Tshinghua University: Department of Computer Science.
[6] J. Salamon, J.P. Bello. (s.a). Unsupervised Feature Learning for Urban
Sound Classification. New York University: Center for Urban Science and Progress, Music and Audio Research Laboratory.
VIII
Table of contents
Preface II
Abstract III
Extended Abstract IV
Table of contents VIII
List of figures XI
List of Matlab Graphs XII
1. Introduction 1
1.1. Motivation 2 1.2. Challenges of Environmental Anomaly Detection 5 1.2.1. Big Data 5 1.2.2. Taxonomy 6
2. Approaches for Anomaly Detection in Big Sound Data 9
2.1. Concept Introduction 9 2.1.1. Input Data types 9 2.1.2. Anomaly types 10 2.1.3. Methodology 12 2.2. Feature extraction 13 2.2.1. Field Knowledge 13
Low-level spectral features 14 Low-level harmonic features 15 Low-level perceptual features 16 Mid-level Temporal Features 17
2.2.2. Data exploration 19 2.3. Classification 19 2.3.1. Supervised 20 2.3.2. Unsupervised 22 2.4. Related work 26 2.4.1. Supervised 26 2.4.2. Unsupervised 28 2.4.3. Conclusions 30
3. Proposed Model New Approach: GMM. 32
3.1. Concept 32 3.2. Data Preparation 32 3.2.1. Missing Data 32 3.2.2. Reorganization 32 3.3. Programming Language 33 3.3.1. Efficiency 33 3.4. Feature Selection 34
IX
3.4.1. Failed try-out 34 3.4.2. Gaussian Mixture Model per minute 36 3.5. Classification 39 3.5.1. Point anomaly: Unusual event 39
2D GMM Clustering 39 5D GMM Clustering 43 5D GMM Clustering (Standardized) 43 2D iK-means Clustering 45 5D iK-means Clustering 46 Conclusion Clustering techniques 46
3.5.2. Point anomaly: Unusual minute 47 Joint Probability 47 Joint Correlation 47
3.5.3. Contextual anomaly 48 3.5.4. Anomaly Threshold Definition 49
Possible System Outcomes 49 Human defined anomalies 49 Threshold 51 Machine Learning Threshold 53
4. Classical approach: Spectral Features 55
4.1. Feature Extraction 55 4.1.1. Spectral Features 55 4.1.2. Principal Component Analysis (PCA) 55 4.2. Classification 57 4.2.1. Point anomaly 57
Gaussian Mixture Model 58 Temporal Smoothing 58
5. Results new approach 58
5.1. Unusual events 59 5.1.1. GMM 2D 59
Anomaly types 59 Model shortcomings 63 Threshold 66
5.1.2. GMM 5D 66 Types of anomalies 67 Model shortcomings 70 Reconstruction of GMM on anomalies 72 Threshold 76
5.1.3. GMM 5D standardized 76 5.1.4. 2D vs. 5D vs. 5D standardized 77 5.2. Unusual minutes 77 5.2.1. Joint Probability 78 5.2.2. Joint Correlation 78 5.3. Contextual anomalies 79
6. Results Classical Approach 80
6.1. Point anomalies 80
7. Model Extension 81
X
7.1. Feature Extraction 81 7.2. Classification 81
8. Conclusion and Outlook 82
8.1. Conclusion 82 8.2. Outlook 83
Bibliography 85
A. Appendix 1
A.1. Features 1 A.1.1. Spectral Centroid (SC) 1 A.1.2. Spectral Spread (SS) 2 A.1.3. Spectral Roll-off Point (SRP) 2 A.1.4. Spectral Entropy (SE) 2 A.1.5. Spectral Kurtosis or flatness 3 A.1.6. Mel Frequency Cepstral Coefficients (MFCC) 3 A.1.7. Bark bands 5 A.1.8. Zero Crossing Rate (ZCR) 5 A.1.9. Spectral Flux (SF) 6 A.1.10. Short Time Energy (STE) 7 A.1.11. Temporal Centroid (TC) 7 A.1.12. Energy Entropy (EE) 7 A.1.13. Autocorrelation (AC) 8 A.1.14. Root Mean Square (RMS) 9 A.2. Matlab Code 10 A.2.1. Workspace_Generator 10 A.2.2. Mid-level_GMM_generator 11 A.2.3. Cluster Gaussian Components 5D 14 A.2.4. Define anomalies based on clustering 15 A.2.5. Plot minutes of anomalous Gaussian components 17 A.2.6. Cluster based on 5D iKmeans 20 A.2.7. Anomalies based on 5d iKmeans 21 A.2.8. Clustering defined by spectral features 23 A.2.9. Anomalies based on spectral feature clustering 25 A.2.10. Unusual minutes Joint Probability 27 A.2.1. Unusual minutes Joint Correlation 28 A.2.2. Plot Unusual minutes based on Joint Probability 30 A.2.1. Plot Unusual minutes based on Joint Correlation 35 A.2.2. Cluster Contextual feature vectors 40 A.2.3. Define contextual anomalies based on clustering 42 A.2.4. plot spectrograms contextual anomalies 43
XI
List of figures
Figure 1: Point anomaly 11
Figure 2: Contextual anomaly 12
Figure 3: Classification methodology 13
Figure 4: Audio Features 14
Figure 5: MFCC extraction process 17
Figure 6: 1-Dimensional GMM 34
Figure 7: 1-Dimensional Gaussian curve fitting 35
Figure 8: 24-hour clock 48
Figure 9. MFCC extraction process 4
XII
List of Matlab Graphs
Matlab graph 1: GMM per spectrogram 36
Matlab graph 2: GMM on a 1 minute spectrogram 37
Matlab graph 3: GMM of one scattered minute 38
Matlab graph 4: Mean values of all Gaussian components 40
Matlab graph 5: Histogram of mean values of all Gaussian components 41
Matlab graph 6: GMM of mean values of all Gaussian components (2D) 42
Matlab graph 7: GMM of mean values of all Gaussian components (3D) 43
Matlab graph 8: 2D presentation of 5D GMM clustering 43
Matlab graph 9: 2D presentation of 5D GMM clustering (standardized) 45
Matlab graph 10: K-means defined clusters 46
Matlab graph 11: mean values of 1257 anomalies defined by 2D GMM 59
Matlab graph 12: Anomaly type 1: high values at high frequencies 60
Matlab graph 13: Anomaly type 2: High amplitude spread at high frequency 61
Matlab graph 14: High amplitudes at low frequencies 62
Matlab graph 15: Anomaly type 4: Low total amplitude variability 63
Matlab graph 16: Event switch during minute 64
Matlab graph 17: Event switch during minute 65
Matlab graph 18: Inaccurate data fitting 65
Matlab graph 19: Inaccurate data fitting 66
Matlab graph 20: Mean values of 1257 anomalies defined by 5D GMM 67
Matlab graph 21: Anomaly type 1: Microphone failure or resumption 68
Matlab graph 22: Anomaly type 2 69
Matlab graph 23: Anomaly type 2 70
Matlab graph 24: Anomaly type 2 70
Matlab graph 25: Anomaly based on 9 Gaussian components 71
Matlab graph 26: Anomaly based on 9 Gaussian components 72
Matlab graph 27: Anomaly based on 9 Gaussian components 72
XIII
Matlab graph 28: anomalous Gaussian out of 5 73
Matlab graph 29: anomalous Gaussian out of 9 74
Matlab graph 30: anomalous Gaussian out of 5 75
Matlab graph 31: anomalous Gaussian out of 9 76
Matlab graph 32: Mean values of 1257 anomalies defined by 5D standardized GMM clustering 77
Matlab graph 33: Unusual minute based on joint probability 78
Matlab graph 34: Unusual minute based on joint correlation 79
Matlab graph 35: unusual minute by their time context 80
1
1. Introduction
The world is urbanizing rapidly, with more than half of the population now living in
cities. Improving urban environments for the well being of the increasing number of
urban citizens is becoming one of the most important challenges of the 21th century.
Many sources such as microphone, camera, gyroscope, accelerometer, luminance,
Global Positioning System (GPS), etc., are available for sensing and capturing various
types of environmental information. Auditory signals are chosen for a number of
reasons. Firstly, among the human senses, hearing is second only to vision in
recognizing social and conceptual settings. This is due partly to the richness in
information of audio signals. Secondly, cheap but practical microphones can be
embedded in almost all types of places and devices. Thirdly, auditory-based context
recognition or classification consumes significantly fewer computing resources than
camera-based systems. In addition, unlike visual sources of information such as
camera and video, audio signals cannot be obscured by solid objects and are
multidirectional, i.e., they can be received from any direction.
This research focuses on urban audio signals, which provide information about the
context of the environment beyond that provided by speech. Nonetheless,
environmental sound research is still in its infancy and traditionally overshadowed by
the popular field of automatic speech recognition (ASR), but recent growth of big data
and urban data analysis has opened up a range of novel application areas, including
acoustic surveillance, environmental context detection and healthcare applications.
The goal of this research is to build a system that detects unusual situations on time,
such as a hazardous situations, manifestations, strikes, etc., using incoming audio as
its only input, inspired by the respective property that humans exhibit in their
everyday life quite effortless. Furthermore it sets the basis to understand the
surrounding environment on a larger time scale in order to detect trends, seasonality
and the development of cities. Such a system should be characterized by accuracy and
flexibility, meaning that with slight alterations the system can work properly under
different kind of environments.
2
1.1. Motivation
Anomaly Detection in Big sensor Data is an interesting and growing field, applicable
on a broad range of sensors. Whether they are electrical outlets, water pipes,
telecommunications, stock exchange rates, cameras, microphones or one of many
others, in all these areas it is important to detect when defaults or anomalies occur.
They all follow the template of large amounts of data that are input very frequently,
calling for an efficient and fast approach to process these Big ata, in order to provide
real time support. The usefulness of anomaly detection is endless, varying from
simple detection of faulty sensors, for example electrical and water sensors, to more
complicated detection, for example unusual behaviour of stock exchange rates.
Furthermore, organizations that supply sensoring services compete in an industry that
has seen huge growth in recent years. One area that these organizations can exploit to
gain market share is through providing more insightful services beyond standard
sensoring. These could lead to different benefits such as energy saving, prediction of
financial portfolios, surveillance, etc. Among the different types of input data, audio
signals in particular are a very interesting source of information with many
advantages. First of all, microphones can easily be installed at all places as they
require a minimum of space and power supply. Secondly, they are cheap and easily
obtainable. Thirdly, compared to cameras, the input data a much smaller, and for
many applications it provides a broader content of the environment, as they are multi
directional. Where cameras only collect information from one direction at a time and
can be boycot by covering them, sound sensors record the interference of everything
that happens closeby and the signals are difficult to subdue.
Sonic records are always an interference of different events coming from different
sources. At their very basis, those sources can be divided in two groups: sound and
noise. Although both are mixtures of sound waves at different frequencies, sound
waves are considered to be ordered, while noise signals are considered to be
disordered. In other words, the mixture of sound waves can be easily separated into
individual frequencies, with some being more dominant than others, while on the
other hand, noise contains all possible frequencies with no presence of a dominant
frequency. This fundamental differentiation translates into different types of sound
classification systems:
3
Speech Recognition (SR) is the most developed research because sound signals, such
as speech and harmonic instruments, are easy to decompose into a limited number of
components. Furthermore, the microphone is usually pointed to a limited amount of
sources on an acceptable distance, reducing the level of interference. As a result, each
spectrogram directly relates with its content and field knowledge can be used for
feature extraction. Mel Frequency Cepstral Coefficients (MFCC) are the dominant
features used for speech recognition, often complimented with other spectral features.
Furthermore, speech consists out of a limited amount of words, spoken in a limited
amount of languages. A dictionary of examples serves as training data and enables
supervised learning. Speech Recognition is thus suitable for supervised learning and
therefore overshadows the more complicated research on 'noise' signals such as in
environmental sonic signals.
Environmental Sound Recognition has gained a lot of attention in the past years with
the growing awareness of its many useful applications. Unlike speech or music
signals, environmental acoustic signals are difficult to model due to its highly
unpredictable nature. They have a much wider variety in frequency content, thus a
broad noise-like spectrum. Furthermore, environmental signals are usually recorded
from a remarkable distance. Many of these unstructured sources now interfere, which
makes it challenging to select the features that best represent the data. Feature
selection of environmental audio signals is therefore a major constraint for the
accuracy of the system. Spectral features have high recognition accuracy in clean
conditions, but they perform poorly in unstructured environments such as urban sound
signals. Therefore, depending on the environment and the knowledge that can be
detracted from the labeled data, field knowledge or data exploration will be chosen for
feature selection. For classification, sound event recognition can still make use of
supervised learning, where different samples of 'known' events serve as training data.
Different records of gunshots for example can form the basis for the detection of
gunshots in urban areas. Despite its usefulness, only those events determined in
advance can be recognised, leaving a lot of information undiscovered.
Environmental Anomaly Detection defines the scope of this thesis. It is a very broad
research and its capabilities are beyond those of speech recognition and
environmental event recognition. Note that the title does not encounter the word
4
'Recognition'. While speech recognition and sound event recognition only search for
certain real-time events, they reject all other unclassified data as useless. They do not
cope with the full spectrum of big data, while in in anomaly detection the whole
dataset is now of interest, whether including a certain event or not. No labelled data is
available anymore; every signal is part of the system and helps defining whether new
incoming data are normal or abnormal. These 'Big Data' are one of the main
challenges of this thesis.
Another difficulty is that the original sound signals are transformed into spectrograms
and there is no ability to reconstruct the original sound waves. This makes intuitive
human supervision based on the sense of hearing impossible. As stated before, the
environment of this research is highly noisy and has a broad spectrum of sources.
Human interpretations are the best taxonomy available so far, thus without this option,
new techniques for feature extraction and classification must be developed to
approach it:
Field knowledge for feature extraction is only possible with a detailed taxonomy for
urban sounds. Because no environment is ever the same and constantly changing over
time, there is a lack of decent taxonomies for environmental sound signals, especially
for urban environments. Furthermore, my personal background in audio science is
rather limited to make reliable decisions based on spectral features. For these reasons
there is opted for Data Exploration rather than Field Knowledge.
Features, however they were extracted, become the new data input for classification.
Supervised learning uses the features of known or labelled data to search for
relationships and focuses on them, rejecting all unrelated features. Unsupervised
learning however, does not need labelled training data and searches for patterns based
on correlations of data and their frequency distributions. This reveals a totally new
range of possibilities and applications, because by clustering all data instead of only
labelled data, the system searches beyond the borders of prescribed events. In urban
environments for example, not only gunshots, screaming people and sirens are useful
to detect. Also unexpected anomalies such as storms or manifestations might be of
interest to detect and notify at early stage. Unfortunately there is no such thing as a
5
free lunch; unsupervised learning contains many challenges and difficulties, which are
described in the next part.
1.2. Challenges of Environmental Anomaly
Detection
1.2.1. Big Data
It is still difficult for a machine listening system to demonstrate the same capabilities
as human listeners for the analysis of sounds other than speech and music, generally
referred to as environmental sounds. Realistic environments consist of multiple and
simultaneous sources in reverberant conditions. Typical tasks on audio scene analysis
include scene classification and event detection recognition, but do not include
research on a larger time scale. Acquiring large scale labeled databases is still
problematic and such databases are most likely collected on heterogenous sets of
certain acoustic conditions, encompassing only limited variations and qualities.
Consequently, the quality of the results depends on the available training data set,
restricting the strength of the research. Upon today most of the methods developed are
probably not tractable on big data so there is a need for new approaches that are
efficient on large datasets, or Big Data.
Big Data is an umbrella term for massive and complex datasets composed of a variety
of data structures. In 2001, Gartner analyst Doug Laney defined data growth
challenges and opportunities as being three-dimensional: volume, velocity, and
variety. Volume refers to the growing amount of data stored, from terabytes to
pentabytes and beyond. With the growing number of applications on acoustic
sensoring, the urge to cope with higher levels of precision on larger time scales grows.
This results in exponentially increasing data. Besides the difficulties to process these
huge amounts of data, the volume also carries one of the biggest advantages of big
data; a paradigm shift in the types of algorithms used, from computationally
expensive algorithms to computationally inexpensive algorithms. The inexpensive
algorithms may have much higher training accuracy error. However, when normalized
over the entire large dataset, the higher training accuracy error converges to a smaller
prediction error. Prediction error is defined as the error accumulated when predicting
new values from a trained predictor. The prediction error for the inexpensive
6
algorithm is within similar ranges as those found with the computationally more
expensive algorithm occurring over a much smaller time frame. Big Data thus allows
more inexpensive algorithms for similar results and this thesis makes use of this
advantage. Velocity can be divided into two different aspects; the velocity of
incoming data and the processing speed. The rate of incoming data incorporates some
important characteristics. The smaller the interarrival times and the smaller the variety
of those, the better the environment can be understood and the more precise
statements can be made about the evolvement of it. The assumption that (acoustic)
sceneries only vary slightly per small time-frame, makes it possible to recognize and
define unusual changes, by gathering data of certain time spans. The second aspect is
the requirement of processing the big data at real-time speed, or faster. When the
system cannot meet real-time processing velocity, the waiting time exponentially
increases and the system becomes useless. But once again, the size of the data is
inversely related with the accuracy of the results and a good balance between these
two desirable but incompatible features must be achieved. This thesis drastically
reduces the input data during the feature selection process, making use of Gaussian
Mixture Models. It enables the system to operate fast and successfully, also with the
accelerating expansion of input data. Variety refers to the component of big data that
includes the requirement to involve additional metadata in the form of tables, other
databases, photos, web retrieved data, social media and primitive datatypes such as
integers, floats and characters. Within the acoustic sensoring domain, other sensors
can provide a variety of additional information that can enrich the research. In this
thesis however, metadata is not included, directly pointing out an interesting topic for
further research. Spectrograms and time are the sole contributors to the input data.
In recent years, additional ”V’s” have been added to this 3V's model to cope with the
evolving requirements for addressing and understanding Big Data, including veracity,
value, and visualization. Currently, businesses are acutely aware that huge volumes of
data, which continue to grow, can be used to generate new valuable business
opportunities through Big Data processing and analysis. The challenge of this
research is to develop a fast and accurate system to detect anomalies based on
environmental acoustic signals.
1.2.2. Taxonomy
7
The ultimate but utopic goal of this thesis and any acoustic classification research is a
precise reconstruction of the original environment, by decomposing the input signal
and allocate each component to its source. Unfortunately, there are some constraints
to audio signals in general, as well as to their records. The first question is whether a
single auditory sensor, irrespective to its recording quality, can provide enough
information for scene reconstruction? Secondly, are the best sonic sensors available
today sensitive enough, or is a higher frequency range and precision required to allow
better research? Thirdly, Is it possible that one general, multi applicable system can be
developed or does every environment need customized research? To even understand
these questions, a basic understanding of sound is required.
Sound is a vibration that is transmitted as longitudinal and transverse waves,
depending on the transmission medium. Without medium, a sound wave cannot
propagate and simply does not exist. Sound waves that travel through gases, plasma
and liquids are longitudinal waves. Through solids, however, sound can be
transmitted as both longitudinal waves and transverse waves. In this thesis project,
only longitudinal waves will be taken into account because the medium that surrounds
the auditory sensors, as well as our ears, is air. Although many sound waves will have
travelled through solids before as transverse waves, as soon as they leave the solid
into the air, they are converted to longitudinal waves. To interpret sound waves, the
receiver must have something that can vibrate in order to translate the vibrating
medium. That can be our eardrums, or auditory sensors that convert the sound waves
into an electrical current that holds all the information.
Simple sound waves result from a simple harmonic motion (SHM) and are made up of
a single frequency component. They are also called notes or harmonics and can be
represented as a sinusoidal waveform, i.e. a sine or a cosine wave, and is characterised
by its amplitude and frequency. Waves can move through each other, which means
that they can be in the same place at the same time and when this happens the
amplitudes of the waves simply add together and form complex sound waves.
Imagine two waves of the same frequency. If the compressions and the rarefactions of
the two waves line up, they strengthen each other and create a wave with a higher
intensity. This type of interference is known as constructive. When the compressions
and rarefactions are out of phase, their interaction creates a wave with a dampened or
8
intensity or even silence. This is destructive interference. The amount of source waves
interfering can be infinite and of different frequencies, each combination creating
different complex wave pattern, either periodic or aperiodic. However, the vast
majority of experienced sounds in nature and daily life are aperiodic.
Aperiodicity means that successive disturbances are not equally spaced in time and
are not of constant shape either. In other words, aperiodic waves do not have a regular
repeating pattern and are perceived as noises. They do not have a harmonic basis, i.e.
the component frequencies are not integer multiples of a fundamental frequency or in
other words, the component frequencies of which they are made up, are not related to
each other. Transient signals are sudden pressure fluctuations that are not sustained or
repeated over time. Examples are the consonants in speech, a hammer hitting the table,
the slamming of a door, the popping of a balloon,...
Urban environments are very different from other environments such as audiences,
households, theatres, schools, etc., in the sense that the data input consists almost only
out of transient signals and is actually interpreted as noise. Very few periodic waves
like human voices or acoustic instruments will occur, making it difficult to build a
common vocabulary. Furthermore, urban landscapes are subject to a large set of
simultaneously happening events, making it difficult to label urban audio data.
Previous work has focused on audio from carefully extracted movie or television
tracks from specific environments such as elevators, households, audiences and office
spaces. The classification of sounds into semantic groups may thus vary from study to
study, making it hard to compare results. These major hindrances together with the
lack of personal knowledge of audio signals, were the motivation for a totally
different approach: Unsupervised Learning. By treating the input data as random
numbers, without relating to its spectral characteristics, taxonomy becomes redundant.
Frequency of occurrence now becomes the basis for anomaly detection.
9
2. Approaches for Anomaly
Detection in Big Sound Data
This chapter describes the different techniques and approaches for anomaly detection
in Big Sound Data. Although the input data for this research comes from an urban
environment, a more general overview is given because the methods are similar and
the goal of this thesis is to develop a system that can be applied on different types of
environments.
2.1. Concept Introduction
2.1.1. Input Data types
One of the major considerations in using an anomaly detection algorithm is based on
the type of records: Categorical and or numerical.
Categorical data represents characteristics such as the weather (raining, cloudy,
sunny), the microphones brand or the type of day (public holiday, weekend, week).
Categorical data can take on numerical values (such as "1" indicating that it's raining,
"2" for cloudy and "3" for sunny), but those numbers don't have mathematical
meaning. You couldn't add them together for example.
Numerical data can be divided into continuous, discrete and binary data. Continuous
data represent measures; their possible values cannot be counted and can only be
described using intervals on the real number line. The original audio signals are a
continuous presentation of amplitude in time. Fourier Transform decomposes the time
scale into the frequencies that the signal is made of. The created graph represents
amplitude and frequency and is called a spectrogram. For the ease of recordkeeping
and to optimise the size of the data versus the quality of information, the original
sound waves have been transformed using a Discrete Fourier Transform, rounding the
continuous data to discrete values.
For this research specifically, the audio spectrum ranges from 20 Hz to 20 KHz and is
divided up into 31 1/3-octave bands. To define the original frequency centres, set the
10
19th 1/3-octave band as centre band, it’s centre frequency to be around 1000 Hz. Then
all lower centre frequencies for 1/3-octave bands can be defined from each other
using the formula 𝑓𝑛−1 = 𝑓𝑛/21/3. Conversely, all higher centre frequencies for 1/3-
octave bands can be defined from each other using the formula 𝑓𝑛+1 = 21/3𝑓𝑛.
2.1.2. Anomaly types
Another important consideration for the algorithms are the relationships within the
data itself. Many applications assume that there exist no relationships between the
records; these are generally considered point anomaly scenarios. Other applications
assume that relationships may exist and based on the type of relationship, they are
referred to as contextual and conceptual anomalies. For anomaly detection, algorithms
rely on the assumption that anomalies are far less frequent than normal records in the
dataset. Further, an underlying assumption in most of these algorithms is that
anomalies are dynamic in behaviour. That means that it is very difficult to determine
all the types of anomalies for the entire dataset, and the future.
A point anomaly is usualy a single record that is considered to be abnormal with
respect to the other records in the dataset, or in other words; an outlier. However,
environmental sound signals are noisy and thus highly volatile. Thereby, a single digit
outlier does not intrinsically mean a real life anomalous situation and filtering them
out would not reveal any useful information. Instead, a period of those noisy signals
can be gathered and described by a mathematical model. According to how accurate
these outliers are and how frequently they occured in that timeframe, they will only
influence the model when they are significant. Now the models of those timeframes
can be studied and those outliers will be seen as point anomalies. Section 3.4.2
describes the formation of these models and the choices made according to
components and time windows. Figure 1 Shows in green a normal minute described
by a Gaussian Mixture Model consisting of 5 components. The red Gaussian
component does not occur frequently in time and is thereby considered a point
anomaly. The cause behind those unusual gaussian components will be discussed in
the results section, but from now on when anomalies are discussed, those modelled
components count instead of single records.
11
Figure 1: Point anomaly
A contextual anomaly is an anomalous record (gaussian component) within a specific
context. For example, a gaussian component or a combination of those may only be
considered anomalous when evaluated in the context of temporal and spatial
information. For example, a reading of certain values may be normal during a random
night, but anomalous during daytime. Pictorally this is shown in Figure 2. Note that
the figure only represents amplitude in function of time and is a true simplification to
visualize the concept.
12
Figure 2: Contextual anomaly
A conceptual anomaly is a record or component, anomalous with respect to the entire
dataset. More specifically, this means that the record may not be considered as
anomalous alone or in temporal prospective. However, when combined within a
collection of sequential records, it may be anomalous when it does not behave
according to the derived patterns or frequency distributions. Due to the lack of time,
this type of anomaly is beyond the scope of this thesis and points out an interesting
topic for future research.
2.1.3. Methodology
Like many other pattern classification tasks, audio classification is made up of three
fundamental components: Sensing, feature extraction and classification. The sensing
section is described above. It is decided beforehand and determines the input data and
thus the start point for this research. The main challenge is the route to choose and the
development of the algorithms, in order to obtain interesting and reliable results.
Figure 3 shows the concept of optional approaches for feature extraction as well as
classification. Each approach is considered based on their related work. The results
and possible applications of the different approaches and their algorithms are the
fundaments for the chosen approach for this specific research.
13
Figure 3: Classification methodology
2.2. Feature extraction
2.2.1. Field Knowledge
Field knowledge relates the spectrogram to its original audio content, based on the
knowledge of how audio features translate acoustic events. A systematic taxonomy of
all the audio features is outside the scope of this thesis, but nevertheless a distinction
can be made according to a time viewpoint: low-level features and mid-level features.
Low-level features are also called spectral features or frequency features while mid-
level features are also called temporal features. Cepstral features are not further
distinguished in this work and belong to spectral features. For this thesis, the raw
audio signals have been converted to spectrograms that serve as input data. Therefore,
feature extraction starts from the spectrogram instead of the raw signal. The possible
14
features to be extracted are numerous. The following list is therefore far from
complete but points the most frequently used features of audio signals. For a detailed
description and the mathematical equation of each one of them, there is separately
referred to Appendix A.
Figure 4: Audio Features
Low-level features are spectrogram characteristics and are derived for each single
sample. They can discover point anomalies without any relation to another sample.
Low-level features can be extracted straight from the Short Time Fourier Transform
(spectrogram), or after the application of a harmonic or perceptual model.
Low-level spectral features
Low-level spectral features are instantaneously computed from the Short Time
Fourier Transform (STFT) of the signal. The frequency domain reveals the spectral
distribution of a signal. For each frequency (or frequency band/bin) the domain
provides the corresponding magnitude and phase. Since phase variation has little
effect on the sound we hear, features that evaluate the phase information are usually
ignored. Consequently, we focus on features that capture basic properties of the
15
spectral properties of audio signal. The references of the low-level spectral features
are: [1][2][3][4][5][6].
- The Spectral Energy (SpE) equals the energy of the signal. It is the sum of the
power of each energy value or amplitude.
- The Spectral centroid (SC) represents the “balancing point”, or the midpoint
of the spectral power distribution of a signal. It is related to the brightness of a
sound. The higher the centroid, the brighter (high frequency) the sound is.
A.1.1
- The spectral spread (SS) is the second central moment of the spectrum. It is a
measure that signifies if the power spectrum is concentrated around the
centroid or spread out over the spectrum. A.1.2
- The spectral roll-off point (SRP) is the N% percentile of the power spectral
distribution, where N is usually 85% or 95%. The spectral roll-off point is the
frequency below which N% of the magnitude distribution is concentrated.
A.1.3
- The Spectral Entropy (SE) defines the complexity of the spectrogram, the
lower the value the more 'ordered' or linear the spectrogram. A.1.4
- The Spectral Kurtosis or flatness gives a measure of flatness of a distribution
around its mean value. The kurtosis indicates the peakedness/flatness of the
distribution. A.1.5
- The Spectral Skewness is a measure of the asymmetry of the data around the
sample mean.
- The Spectral Slope represents the amount of decreasing of the spectral
amplitude. It is computed by linear regression of the spectral amplitude.
Low-level harmonic features
Low-level harmonic features are derived after the application of a harmonic model. At
each time frame, the peaks of the Short Time Fourier Transform (STFT) of the
16
windowed signal segment are estimated. Peaks close to multiples of the fundamental
frequency at this frame are then chosen in order to estimate the sinusoidal harmonic
frequency and amplitude [7].
- The fundamental frequency of a harmonic signal is the frequency so that its
integer multiple best explain the content of the signal spectrum. The
fundamental frequency has been computed using the maximum likelihood
algorithm [8]. A higher resolution of the dataset is needed, thus this feature
cannot be used.
- The noisiness is the ratio between the energy of the noise (non-harmonic part)
to the total energy. It is close to 1 for purely noise signal and 0 for purely
harmonic signal.
Low-level perceptual features
To obtain Low-level perceptual features, the audio signal is processed by a filter bank
to compress the signal without humanly noticeable changes. From the resulting
signals, linear and non-linear predictors are computed after which spectral features
can be derived, together with model specific coefficients. Before the advent of
modern digital signal processing, band pass filters were the only way to become a
Time-Frequency presentation: they divide the input signal into frequency bands and
the magnitude of each filter's output controls a transducer that records the
spectrogram as an image on paper. But for the scope of this thesis, filters are
considered as part of feature extraction techniques onto the already created
spectrogram.
- For this thesis, the third octave filter bank has been applied on the spectrogram
and is the start point for further research.
- The Log-Gabor filter, named after Dennis Gabor, is an improvement of the
original Gabor filter that is primarily used for edge detection in image
processing. It offers simultaneous localization of spatial and frequency
information, while Fourier Transform only provides frequency information.
Examples of Linear band conversions:
17
- Linear Frequency Cepstral Coefficients (LFCC) are similar to Mel Frequency
Cepstral Coefficients (see below), but the MFCC scaled filter banks become
wider at higher frequencies while they are equally spaced for linear filter or
LFCC.
- Linear Predictive Coding (LPC) is a mathematical operation where future
values of a discrete-time signal are estimated as a linear function of previous
samples.
Examples of logarithmic conversions:
- Mel Frequency Cepstral Coefficients (MFCC) originate from automatic
speech recognition but evolved into one of the standard techniques in most
domains of audio recognition applications such as environmental sound
classifications. They represent timbral information (spectral envelop) of a
signal. Computation of MFCC includes conversion of the Fourier coefficients
to Mel-scale. After conversion, the obtained vectors are logarithmized, and
decorrelated by discrete cosine transform (DCT) in order to remove redundant
information. Figure 5 shows the process of MFCC feature extraction [9][12].
Figure 5: MFCC extraction process
- Gammatone Frequency Cepstral Coefficients (GFCC) are a variant of the
MFCC using the cubic root of the time frequency representation instead of the
log, in combination with a gammatone weighted filter bank instead of a Mel
weighted filter bank.
- Although the Mel bands are used for the Mel Frequency Cepstral Coefficients,
the Bark bands are a better approximation of the Human Auditory System.
This latter will be used for the calculation of the Loudness, specific loudness,
sharpness and spread. A.1.7
Mid-level Temporal Features
18
Other than low-level features, they discover relations of subsequent frames and are
able to recognise events and thus detect temporal anomalies. The size of the frames
and their overlap is of crucial significance.
- Zero-crossing rate (ZCR) is the most common type of zero crossing based
audio features. It is defined as the number of time-domain zero crossings
within a processing frame [10]. A.1.8
- The Spectral Flux (SF) defines the amount of frame-to-frame fluctuation in
time: i.e., it measures the change in the shape of the power spectrum [11].
A.1.9
- Short Time Energy (STE) is one of the energy based audio features. It is easy
to calculate and provides a convenient representation of the amplitude
variation over time. It indicates the loudness of an audio signal and is thus
reliable indicator for silence detection [12][13][14][15]. A.1.10
- Temporal Centroid (TC) is the time average over the envelope of a signal in
seconds. It is the point in time where most of the energy of the signal is
located on average. A.1.11
- Energy Entropy (EE) can be interpreted as a measure of abrupt changes in the
energy level of an audio signal. A.1.12
- Autocorrelation (AC) represents the correlation of a signal with a time-shifted
version of the same signal for different time lags. It reveals repeating patterns
and their periodicities in a signal and can be employed, for example, for the
estimation of the fundamental frequency of a signal. This allows
distinguishing between sounds that have harmonic spectrum and non-
harmonic spectrum, e.g., between musical sounds and noise [1]. A.1.13
- Root Mean Square (RMS) is a measurement of energy in a signal. A.1.14
- Mean and Variance and standard deviation are the ever classic, simple and
useful features to get fast insight into data.
19
2.2.2. Data exploration
Audio classification systems have been traditionally relied on hand crafted audio
features with the Mel Frequency Cepstral Coefficients being the most popular choice.
MFCC simulate the human hearing sense by extracting those features that are the
most important for human interpretation. Consequently, they are highly effective
when it comes to music and speech analysis, but they might oversee important aspects
in noisy environments. Data exploration searches for characteristics of the data
without any interpretation assumption. By clustering data in intelligent ways,
characteristics can be revealed and used as features. As this is a form of clustering, the
same algorithms as clustering for classification can be used, which are listed in the
next paragraph. Data exploration is obviously more complex than field knowledge but
enables new insights and flexibility over environments and applications.
2.3. Classification
Algoritms for detecting anomalies, or any other kind of classification, can be
categorized based on the types of data labels known apriori. If the algorithm knows a
set of example records labelled as anomalous, it is referred to as a supervised learning
algorithm. The examples, commonly termed as training samples, are used to teach the
classifier how to assign an unseen feature vector to the correct class. In contrary, if the
algorithm has no notion of the labels of data, anomalous or otherwise, the algorithm is
referred to as a unsupervised learning algorithm. Further, a third category titled semi-
supervised learning algorithms, involves approaches that assume the training data
only has labelled instances for the normal datatype. This is a more common approach
to anomaly detection than supervised learning algorithms as it is norrmally difficult to
identify all the abnormal classes. Finally, most anomaly detection applications in
practice have the anomalies defined through human effort. As a result, it is much
more common in practice to find datasets which are unlabelled, or partially unlabelled,
requiring the use of an unsupervised or semi-supervised learning algorithm. Many
supervised algorithms can be seen as both supervised and unsupervised. A discussion
is thus possible about the categorization below but only the understanding of the
algorithms matters in order to become creative and innovative in applying them. See
their referred appendix for a more detailed description.
20
2.3.1. Supervised
- Gaussian Mixture Models (GMM) are a form of statistical modelling.
Statistical modelling approaches rely on the assumption that normal records
occur in high probability regions of a stochastic model, while anomalies occur
in the low probability region. These techniques fit a statistical model to the
given training dataset and then apply a statistical inference model to determine
if the test record belongs to the model with high probability. There are two
types of statistical modelling techniques: parametric and non-parametric
models. Parametric techniques assume that normal data is generated by a
parametric distribution; such an example is the Gaussian model and the
Regression model. Non-parametric techniques make few assumptions about
the underlying distribution of the data and include techniques such as
histogram-based, and kernel-based approaches. GMMs are used in classifying
different audio classes. It is an intuitive approach when the model consists of
several Gaussian components, which can be seen to model acoustic features.
In classification, each class is represented by a GMM and refers to its model.
Once the GMM is trained, it can be used to predict which class a new sample
probably belongs to. The potential of Gaussian mixture models to represent an
underlying set of acoustic classes by individual Gaussian components, in
which the spectral shape of the acoustic class is parameterized by the mean
vector and the covariance matrix, is significant, especially if the dataset
(feature data points) is large. Moreover, these models have the ability to form
a smooth approximation to the arbitrarily shaped observation densities in the
absence of other information. With Gaussian mixture models, each sound is
modelled as a mixture of several Gaussian clusters in the feature space. There
are several advantages to the statistical modelling technique to anomaly
detection that make it work for Big sensor Data cases. First, if assumptions
hold true then the approach provides a statistically justified solution to the
problem. Second, the anomaly score output by the statistical model can be
associated with a confidence interval, which can be manipulated by the
algorithm. This provides additional information for the algorithm to use to
improve the efficiency and performance of the algorithm. Finally, the Big
sensor Data context normally follows a normal, or Gaussian, distribution.
21
However, there are several disadvantages to the statistical modelling approach.
First, if the underlying assumption of the data being distributed from a
particular distribution fails then the statistical technique will fail. Second,
selecting the test hypothesis is not a straightforward task as there exists several
types of test statistics. While these disadvantages certainly effect the Big
sensor Data scenario, there are ways to overcome the disadvantages while
retaining the aforementioned advantages. For example, evaluating the
algorithm with respect to a variety of test statistics can be done a priori.
Comparing the results of this step will allow the algorithm to select the most
appropriate test statistic for the given use case.
- Hidden Markov Models (HMM) are a tool for representing probability
distributions over sequences of observations. It gets its name from two
defining properties. First, it assumes that the observation at a certain time was
generated by some process whose state is hidden from the observer. Second, it
assumes that the state of this hidden process satisfies the Markov property:
that is, given the value of the previous state, the current state is independent of
all the states prior to the previous one. In other words, the state at some time
encapsulates all we need to know about the history of the process in order to
predict the future of the process.
- The K-Nearest Neighbor (K-NN) algorithm is despite its simplicity, well
tailored for both binary and multi-class problems. Its outstanding characteristic
is that it does not require a training stage in the strict sense. The training
samples are rather used directly by the classifier during the classification stage.
The key idea behind this classifier is that, if we are given a set of patterns
(unknown feature vector), we first detect its k-nearest neighbors in the training
set and count how many of those belong to each class. In the end the feature
vector is assigned to the class that has the highest number of neighbors.
- Support Vector Machines (SVM) are classifiers that have been successfully
employed in numerous machine-learning fields. It is a very effective method
for general-purpose pattern recognition. Given a set of points that belong to
either of two classes, a SVM finds a hyper plane leaving the largest possible
22
fraction of points of the same class on the same side, while maximizing the
distance of either class from the hyper plane.
- Bayesian networks (BN) are commonly used in multi-class anomaly detection
scenarios. Bayesian networks are a classification-based approach whereby the
network estimates the probability of observing a class label given a test record.
The class label with the highest probability is chosen as the predicted class
label for the test record. There are a few advantages to the classification-based
Bayesian network approach. First, multi-class Bayesian networks use a
powerful algorithm to distinguish between instances belonging to different
classes. Second, the testing phase is fast. This is important in anomaly
detection for sensor networks, as real-time evaluation for all the sensor
readings must be processed. Without a real-time testing phase, this is
impossible. Bayesian networks also have some disadvantages for anomaly
detection that eliminate them from contextual-based approaches. First, where
there are multiple anomaly class labels or multiple normal class labels, the
system relies on those readily available class labels for all the normal classes.
As was previously discussed, this is generally not possible for practical
purposes. Human intervention is normally required to label training records,
which is not always available. Second, Bayesian approaches assign discrete
labels to the test record. This is disadvantageous for sensor networks as
meaningful anomaly score is not always required. However, when evaluating
the effectiveness of the anomaly detection approach, it is useful to
discriminate between anomaly scores. This is more evident in Big Data
contexts as a continuous anomaly score can aid in reducing the overall number
of anomalies needing to be evaluated.
2.3.2. Unsupervised
- Neural networks, in particular Deep Learning Networks, are one of the
primary tools for feature learning. A neural network is a complex hierarchy of
linear and nonlinear nodes that can learn complex functions automatically,
given they are fed by enough input data. Deep networks are many-layered
versions of neural networks, and the deepness of them allows them to learn
hierarchies of features, enabling them to automatically learn very high level
23
features about a signal. Neural networks were invented as early as the 1940s,
and were researched extensively until the early 1990s, when successes in other
machine learning algorithms eventually took the focus in the machine learning
community. However, in the last decade, important advancements in training
deep networks, combined with the general increased computing power
available in society brought neural networks back to prominence. Across
computational research genres, but specifically in the image and audio
research, deep neural nets have recently shown promising results. One of the
biggest problems with deep neural networks is that the derivatives that are
back propagated during supervised training become extremely weak so as to
be of minimal effectiveness by the time they reach the beginning of the
network. In particular, it was shown that the use of greedy layer-wise training
is what brought the neural network back to prominence. This initializes the
data in an unsupervised fashion, one layer at a time, while freezing the weights
of the other layers. An unsupervised algorithm such as k-means or sparse
coding is typically used for this. Unsupervised initialization of the network
significantly improves the performance of neural networks.
- Self Organizing Maps (SOM) or Self Organizing Feature Maps (SOFM) are a
type of Artificial Neural Networks, invented by professor Kohonen. It uses a
data visualization technique to reduce the dimensions of data through the use
of self-organizing neural networks. The problem that data visualization
attempts to solve is that humans simply cannot visualize high dimensional data,
so techniques are created to help us understand this high dimensional data.
The way SOMs go about reducing dimensions is by producing a map of
usually one or two dimensions, which plot the similarities of the data by
grouping similar data items together. So SOMs accomplish two things, they
reduce dimensions and display similarities.
- K-means Clustering relies on a single underlying assumption for anomaly
detection: normal data records occur in dense neighbourhoods, while
anomalies occur far from their closest neighbours. The major consideration for
nearest neighbour-based approaches is that an eligible distance or similarity
metric must be used for comparing data records. For some datasets this is
24
simple; for example, continuous features can be evaluated with the classical
K-means algorithm that uses Euclidean distance. For multiple features, each
feature is compared individually and then aggregated. Using K-means
clustering is usually straightforward and only requires the correct similarity
metric. The definition of the similarity metric is thus one of the major
drawbacks for K-means based approaches. In many cases, including the audio
sensor data scenario, it is difficult to determine a suitable similarity metric for
the aggregation and variety of unstructured, semi-structured, and structured
data. The performance of the K-means algorithm relies heavily on this
similarity metric, and thus when it is difficult to determine the distance
measure, the technique performs poorly.
- Spherical k-Means [17] is a slight variation on the traditional K-Means
algorithm, where the centroids are constrained to the unit sphere at each
update step. This has the function of using the cosine distance for similarity to
points in the input space instead of the Euclidean distance typically used in
traditional k-Means.
- The intelligent K-means or iK-means algorithm, designed by Mirkin in 2005,
addresses another drawback of the standard K-means. The standard K-means
algorithm needs the number of clusters K as an input parameter, while this is
actually a variable that is sought after. Predefining the number of clusters K is
solved by the iK-means algorithm, based on the following principle: the
farther a point is from the centroid, the more interesting it becomes. iK-means
uses the basic ideas of Principal Component Analysis (PCA) and selects those
points farthest from the centroid that correspond to the maximum data scatter.
These Anomalous Patterns (AP) are formed as explained below:
1. 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑟𝑒 𝑜𝑓 𝑔𝑟𝑎𝑣𝑖𝑡𝑦 𝑜𝑓 𝑔𝑖𝑣𝑒𝑛 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 𝑐𝑔.
2. 𝒓𝒆𝒑𝒆𝒂𝒕 3. 𝑐𝑟𝑒𝑎𝑡𝑒 𝑎 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑐 𝑡ℎ𝑒 𝑓𝑎𝑟𝑡ℎ𝑒𝑠𝑡 𝑓𝑟𝑜𝑚 𝑐𝑔.
4. 𝑐𝑟𝑒𝑎𝑡𝑒 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑐𝑖𝑡𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑐𝑙𝑜𝑠𝑒𝑟 𝑡𝑜 𝑐 𝑐𝑜𝑚𝑝𝑎𝑟𝑒𝑑 𝑡𝑜 𝑐𝑔.
5. 𝑢𝑝𝑑𝑎𝑡𝑒 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑜𝑓 𝑠𝑖𝑡𝑒𝑟 𝑎𝑠 𝑠𝑔.
6. 𝑠𝑒𝑡 𝑐𝑔 = 𝑠𝑔.
7. 𝑑𝑖𝑠𝑐𝑎𝑟𝑑 𝑠𝑚𝑎𝑙𝑙 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝑢𝑠𝑖𝑛𝑔 𝑎 𝑝𝑟𝑒𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑. 8. 𝒖𝒏𝒕𝒊𝒍 𝑠𝑡𝑜𝑝𝑝𝑖𝑛𝑔 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛 𝑖𝑠 𝑚𝑒𝑡.
25
At the end of the iK-means, only the good centroids will be left, as small
anomalous pattern clusters are discarded.
- Sparse coding is based on the fact that sparse feature vectors contain mostly
values of zero and one or a few non-zero values. Although these feature
vectors can be classified by traditional machine learning algorithms, there are
various recently developed algorithms that explicitly take advantage of the
sparse nature of the data, leading to massive speedups in time, as well as
improved performance. Because of their speed, these algorithms perform well
on very large collections of data such as audio big data.
- Non-negative Matrix Factorization (NMF) is an algorithm for describing the
data as the product of a set of bases and a set of activations, both of which are
non-negative. It is useful for finding parts based decompositions of data. Since
all the components are non-negative, each basis contributes only additively to
the whole. This promotes a solution in which high-energy foreground events
and constant low-energy background energy may be described by different
bases, and therefore separated in the feature presentation. Most applications of
NMF to audio processing decompose spectral magnitude frames (columns of a
spectrogram), and have each NMF bases consist of a single short time frame.
Since we are interested in learning bases that correspond to entire events, we
use the convolutive information of NMF. In this version, bases consist of
spectro-temporal patches of a number of spectral frames stacked together. The
pattern described by each frame is then activated as a whole to contribute to
the reconstruction of the data. Additionally we would like to ensure some level
of sparsity in the activations of these bases. This is in order to encourage the
bases to learn more foreground event patterns and fewer patterns that mimic
the background, which would be activated non-sparsely over large segments
of the data. This NMF algorithm allows us both to locate transients in time,
and to build a dictionary of event-patch code words, within a single
optimization framework, avoiding the separate transient detection and patch
clustering of our earlier approach.
26
- Principal Component Analysis or PCA whitening [19] does what its name says,
it finds the principle components of data. It is often useful to measure data in
terms of its principle components rather than on the standard x-y axis. The
principle components are the directions where there is the most variance,
where the data is most spread out. A set of data points is deconstructed into its
pairs of eigenvectors and eigenvalues. An eigenvector is a direction in which
the highest variances occur, while an eigenvalue is a number telling you how
much variance there is in the data in that direction. The eigenvector with the
highest eigenvalue is therefore the principal component. Instead of ZCA
whitening, PCA whitening serves to further decorrelate the inputs from each
other. This step can significantly help the quality of the input representation,
and serves to decorrelate the input representation.
2.4. Related work
As stated before, algorithms are used for feature extraction in the form of clustering,
as well as for classification purposes. Furthermore, some algorithms generally stated
as supervised algorithms can be used in an unsupervised matter. For the division of
related work, the main algorithm and the way it is used, i.e.: supervised or
unsupervised, matters. Feature extraction in the form of data exploration is thus
considered unsupervised. Conclusions will be made based on their effectiveness and
application purposes.
2.4.1. Supervised
- Ntalampiras, Potamitis and Fakotakis [20] describe acoustic surveillance of
hazardous situations. They make use of the MFCC's with additional low-level
features, such as fundamental frequency and the audio spectrum flatness,
based on the MPEG-7 standard. Gaussian Mixture Models are used for
clustering and Hidden Markov Models for classification, using labelled data of
explosions, gunshots and screaming in a subway as training data. First, a
GMM of 19 models is constructed after which a HMM should classify each
atypical situation. Screaming has been detected perfectly, gunshots with 93%
accuracy and explosions with only 86%. This research confirms that noisy
27
sounds are very difficult to classify when based on classic low-level features
and the need for a new approach is prominent.
- Cotton, Courtenay et al [21] compare spectral features to spectro-temporal
features for acoustic event detection, applied on soundtracks. The sample rate
is 12kHz. For the spectro-temporal features, they made use of convolutive non-
negative Matrix factorization (NMF) with the goal to separate event features
(activations) from constant background (basis patches). Each patch consists
out of 32 frames, and 20 basis patches are distinguished. The system seeks per
event type the combination of the basis patch with the activation patch that
contributes the most energy. For the activation patch assignment, a sliding
window of 1s with hops of 250ms is used. From this window, the log of the
maximum of each activation dimension is derived and normalized so that each
basis has a maximum activation of 1 over the entire dataset. The parallel
approach of comparison contains low-level features, more specifically 25
MFCC's. The frame spacing for the low-level features is much smaller, only
10ms instead of 250ms. The features of both techniques are classified making
use of Hidden Markov Models. The observation matrix is trained with
assumption that each event class can be modelled by a simple Gaussian
distribution. A transition matrix is build for the stream of labels. To conclude,
the MFCC's outperform the features found by NMF for the original sound
track data set. With the addition of noise, the accuracy of MFCC decreases
significantly while NMF remains stable.
- Radhakrishnan, Divakaran and Smaragdis [22] explain the use of Gaussian
Mixture Models for the analysis of audio for surveillance applications. Usual
background examples serve as labelled input data thus this research can
actually be classified as 'semi-supervised'. A GMM models this usual
background and the likelihood to classify new arriving data under the
background model is used to flag suspicious events. In the absence of a
suspicious event, the GMM is incrementally upgraded. Thus in case of an
inlier, the model is updated while an outlier can have two different outcomes;
false alarm or a potential risk. The sampling rate is 125Hz and feature vectors
consist out of 12 MFCC's, which are to be modelled by a GMM. The penalty
28
technique used to minimize the expression is the Minimum Description
Length (MDL). This optimizes the GMM in terms of complexity: as less as
possible factors (classes) without too much loss of dimension. Such a GMM is
made for each sound class (such as 'rush hour' or 'night'). Furthermore,
possible anomalies are investigated in a temporal way: The sequence of
classified normalities and anomalies is smoothed to filter out sole
instantaneous anomalies, as they are probably noise.
2.4.2. Unsupervised
- Hao et al [23] describe a parameter free audio-motif discovery in large data
archives. It claims to be parameter free but the one and only significant
parameter is the size of the events that are sought after. It assumes that this
frame-length is known, for which it cannot fully be seen as unsupervised
learning. A misunderstanding of the window length drastically changes the
results as no overlap is used. Their research is thus useful for the recognition
of events that very much look alike. It randomly searches for windows and
uses the highest count of non-trivial matches as a distance measure. The
system quits when probabilities of finding a better window decrease below a
certain threshold.
- Justin Salomon and Juan Pablo Bello [17] introduce a data exploration method
for feature extraction. Starting from the Mel spectrogram, the technique is
named scattering transform. The sampling rate is 44kHz and the time window
is 370ms, which means that 16280 frames are scattered. It is phase invariant,
just as the system of this thesis. It applies a wavelet filter bank onto a signal,
hierarchically ordered. The first order is octave, existing of 8 divisions and
thus comparable to the similar sized Mel spectrogram. For each frequency
octave band, the high frequency amplitude modulations are captured by the
second order coefficients. Higher order coefficients can be calculated but most
of the signals energy is captured by the first and second order. Clustering for
classification is done using spherical K-means. Unlike the traditional K-means
clustering, the centroids must lie on the unit sphere.
29
- Another research of Salomon and Bello [26] is 'Unsupervised feature learning
for urban sound classification.' It also starts with the Mel spectrogram and
applies PCA whitening onto them. The feature vectors are downsampled by
taking the mean and variance of all those captured by a certain time window.
The K-means algorithm clusters these averaged feature vectors and builds a
codebook with the K-means as words. A Random Forest Classifier is used for
the classification process but is not further described here. The interesting part
of this paper is the gathering of a certain time window and the K-means
clustering technique to construct the codebook.
- Gomes and Batista [24] also make use of motifs, but this time specifically
applied on urban sounds. To transform the raw data into useful information, it
makes use of SAX, which is comparable with the Fourier Transform but needs
less space. Each clustered spectrum is represented as a letter from the alphabet.
The bigger the alphabet, the higher the resolution. A series of letters is now
obtained, which is divided into segments of equal length, using the Piecewise
Aggregation Approximation (PAA) algorithm. Each segment gets assigned
with the average number, together with the amplitude. Subsequences are now
compared and top similar subsequences are called a motif. When they look
exactly the same, a higher resolution is applied to see if they actually are the
same. The biggest problem in this approach is again the predefined length of
the segments.
- Lee, Largmann, Pham and Ng [25] used Convolutional Deep Belief Networks
for audio classification in an unsupervised manner. The deep belief network is
a probabilistic model composed of one visible layer and many hidden layers.
Each hidden layer unit learns a statistical relationship between the units in the
lower layer; the higher layer representations thus tend to become more
complex. The hidden layers are trained once at a time, bottom up. In this paper,
the convolutional deep belief networks are applied on unlabelled auditory data
such as speech and music. The contribution for this thesis is thus rather small,
but it is interesting as it is truly unsupervised and a successful alternative for
baseline features such as the MFCC, as the learned features correspond to
phones and phonemes.
30
- Cai, Lu and Hanjalic [27] developed unsupervised content discovery in
composite audio. This research can be divided into two parts; audio elements
discovery and spotting key elements. Spectral features together with the
MFCC's add up to 29 dimensions for the feature vectors. Again, windows
gather different feature vectors and the mean and standard deviation are now
the new features. The window length is 1 second and the overlap 0,5s. The K-
means algorithm is used to cluster these mean and standard deviation feature
vectors. After clustering, a time series can be generated and smoothed to
eliminate anomalies caused by error or noise. Key features can be spotted by
their occurrence frequency. An analogy to text is made and assumes key
features shorter in length and less common. One unifying importance score for
key elements is obtained. The first K elements are key elements. Now that
key elements are distinguished from background, a temporal research takes
place to detect which key elements occur together and accompanied by which
background. In other words, scenes are detected. If the affinity (based on a
threshold) between two key elements is low, a new scene starts, otherwise it is
seen as one. Scenes are again being clustered with K-means, using the
Bayesian Information Criterion (AIC) as complexity penalty algorithm.
2.4.3. Conclusions
The main conclusion that can be made from related work is that research depends
either on the extraction of classically known low-level features, or there is labelled
data available so features can be derived by data exploration. The major part of
research even combines both field knowledge for feature extraction and supervised
learning for classification. It lays in men's nature to apply knowledge rather than to
dive into the unknown and furthermore, it performs quite accurately for the fields that
are most attractive and thus received most attention: speech and music. Ntalampiras et
al [20] for example rely on MFCC's and have labelled data available. Their evident
conclusion is that it only works accurately for speech involved samples.
Radhakrishnan et al [22] also rely on MFCC's for feature extraction and have samples
of 'normal' audio samples, in the search for anomalies. This research already belongs
to the semi-supervised category because it is now unknown what is looked after.
Obviously, as this thesis treats urban sound signals, the focus of related work is not
31
onto low-level features combined with supervised learning thus the listing above is
not representative for the focus of audio analysis in general. Data exploration for
feature extraction, as well as unsupervised classification, is still in its infancy. A
combination of both seems even inexistent and this is why this thesis is unique,
because it combines data exploration with unsupervised classification. The reason is
very simple, standard low-level features have been proven to be ineffective for
environmental audio signals, there is little to no additional information about what
types of features could be significant, there is no labelled data available and there is
no possibility to create them by human supervising.
Interesting work for feature extraction is the idea of scattering, done by Salomon and
Bello, applied in two of their papers [17] [26]. Although they start from the Mel
spectrum, it is an interesting start point to discover features. Also Cai, Lu et al [27]
scatter the feature vectors and apply statistical parameters. For this thesis, instead of
scattering low-level features, the spectrograms could be scattered and described by a
more complex statistical model instead of basic parameters. Classification inspiration
comes from the same papers [17][26][27] for their use of K-means, but especially the
use of GMM's is attractive for both feature extraction as well as classification,
inspired by Ntalampiras et al [20].
32
3. Proposed Model New
Approach: GMM.
3.1. Concept
The approach of this thesis is data exploration for feature extraction, combined with
unsupervised learning for classification. In chapter 4, a more classical research is
applied, starting from spectral features and using similar techniques for classification.
The results are compared and their similarities and dissimilarities discussed in chapter
5 and 6.
3.2. Data Preparation
3.2.1. Missing Data
The data is provided in gunzip files per hour of time, containing roughly eight
spectrograms a second. There is a little deviation on the inter arrival time of data but
that is negligible. In order to detect missing data, and let them be part of the system,
empty files have been filled with zero values and included in the research. The main
idea behind it is that malfunctioning or defection of a microphone can become part of
the research. When malfunctioning and defects are integrated, trends in those can
also be revealed and alerted when a shift occurs. The minute in which the microphone
fails or resumes will anyhow get detected as a point anomaly because it partially
exists of data and partially of zero values, which otherwise never occurs. According to
contextual anomalies the zero values can also be useful to detect at which times it is
rather unusual that microphones fail. For example, the system might find a failure
quite normal, unless it always happens in wintertime and all of the sudden it fails
during summer. This might help discovering causes of breakdown or other
disturbances.
3.2.2. Reorganization
For the ease of use, the original data is reorganized into a file per month with two
cells per hour, one containing the data matrix and one containing the time details.
33
3.3. Programming Language
Matlab is chosen for its efficiency and ready-made functions. There is also a lot of
external information available, which is helpful when addressing a niche field.
3.3.1. Efficiency
Throughout this project, there has been taken great care of efficiency in code writing.
Basic rules such as pre-allocation, the use of functions, clearing loops from
independent variables, the use of disk variables (matfile) to access data without
loading the total workspace, and many other basic rules.
The first stage of this project was running on a MacBook air 11" and the runtime was
consequently a restrictive element. Matlab includes a Parallel Computing Toolbox,
which enables to pool the processor cores. 'For' loops are to be replaced with 'parfor'
loops and must meet some specific demands in order to be effective. The most
difficult demand to meet is that each iteration of the loop must be independent of any
other iteration. Nested loops are thus very difficult to implement, as everything must
be written off to a cell from which it can be looked up in another iteration. Parfor is
truly effective in runtime as it doubles the execution speed. On the other hand, the
implementation is very time consuming and an intensive thinking process thus it is
only used for very computational intensive tasks in this thesis.
Another technique is to cluster different servers, which could be applied with the help
of a friends MacBook Pro 15". A cluster connects different processors and uses their
cores in parallel. Unfortunately, the used version of Matlab does not have access to
the Distributed Computing Server toolbox. However, there exists a system that can
get around the Matlab toolbox and operates just as successful. It is called Matlab MPI,
the successor of pMatlab, and is developed by the MIT. The difficulty is the setup, as
it needs a paswordless ssh connection between the two processors, requiring to bypass
all security systems that are automatically implemented. The efficiency of Matlab
MPI again depends on the complexity of the task, as the setup of the cluster and the
division of the tasks can take more than an hour. For very computational intensive
scripts, it does compensate the setup time and is thus used accordingly.
34
3.4. Feature Selection
3.4.1. Failed try-out
An initial attempt was to create low-level spectral features by generating a Gaussian
Mixture Model per spectrogram. There are three major issues with this approach: First,
to generate a GMM, scatter points of a certain distribution are needed, as shown in
Figure 6.
Figure 6: 1-Dimensional GMM
Instead, the data points of the spectrogram are actually weighted data points instead of
scatter points, shown in Figure 7. Matlab is not suited for curve fitting, it needs the
scattered data points, which involves significant data expansion.
35
Figure 7: 1-Dimensional Gaussian curve fitting
Second, in order to know the optimal amount of components to describe the
spectrogram, the number of components is initially set high and descents, applying the
AIC penalty algorithm to determine the best amount of components. The Aikaike
Information Criterion or AIC is a measure of quality for a given set of data. Given a
collection of models to describe the data, the AIC estimates the quality of each model,
relative to its other models. Let L be the maximum value of the likelihood function for
the model and let k be the number of estimated parameters of the model. Then the
AIC value of the model is the following:
𝐴𝐼𝐶 = 2𝑘 − 2𝑙𝑛(𝐿)
Given a set of candidate models for the dataset, the preferred model is the one with
the lowest AIC value. Because the now simulated scattered data occurs in bins, it is
ill-varianced and gives an error in the attempt to extract too many normal components.
This is a problem because it occurs as soon as the maximum amount of components is
set slightly higher than the optimal, which is necessary to run the AIC technique.
36
To solve this problem, another GMM function available on the Internet has been
applied that bypass the ill-variance error notification. It is the EM_GM function
constructed by Patrick P.C. Tsui, member of the research group of the Department of
Electrical and Computer Engineering at the University of Waterloo. This function,
which also uses the AIC criterion, might not encounter any hindrances to compute the
GMM's, but it reveals that each histogram needs another number of Gaussian
components to accurately fit the distribution, meaning that the number of components
cannot be hard coded, which is computationally very expensive.
Thirdly, with all the computational effort and data reconstruction, the result is not as
excited and precise as was hoped for, as shown in Matlab graph 1.
Matlab graph 1: GMM per spectrogram
For the above described reasons, i.e. the data expansion, the ill-variance and the poor
fit, the idea is abandoned.
3.4.2. Gaussian Mixture Model per minute
37
Creating low-level spectral features by a descriptive model or curve fitting, is proven
to be expensive, both in computation time and in data size. A better approach is to
describe spectral and temporal features in one model. By scattering different
spectrograms in one single graph, relations between data point in frequency as well as
in time, can be captured. This significantly reduces data in size and furthermore, it
makes use of the known fact that data points in vicinity of time or frequency, do not
differ a lot from each other. They can thus be described by a single entity.
Different time windows have been observed and a 1-minute time frame gives the most
interesting impression. When looking at Matlab graph 2, it seems that different events
occur, concentrated around a mean and thus approaching normal distributions. A
Gaussian Mixture Model could describe the total minute possibly in a very accurate
way and furthermore distinguish each of these seemingly independent events.
Matlab graph 2: GMM on a 1 minute spectrogram
To give a better impression of the GMM fit, the following Matlab graph 3 shows the
3D presentation of the GMM, which looks promising as it simulates the original
scatter points quite precisely.
38
Matlab graph 3: GMM of one scattered minute
In order to save a lot of computational power, hard coding the number of Gaussian
components is tested for feasibility. Hundred random minutes have been scattered and
their optimal number of components has been determined with the AIC criterion. Of
those 100 random minutes, 94 of them point five components as the optimal number
to describe, which gives the possibility to hard code the number of components to five,
saving a lot of runtime.
The result is that each minute, originally described by 480 spectrograms of each 31
digits (14880 in total) is now described by a GMM of five components, each
consisting of five digits (25 in total), good for a data reduction of 595 times. Of the
five parameters describing each Gaussian component, two are allocated to the mean
and three to the covariance:
The mean value is defined by its coordinates, one of the frequency axis and one of the
amplitude axis.
39
The variance of a Gaussian distribution in two-dimensional space cannot be
characterized fully by a single number, nor do the variances in the x and y directions
contain all of the necessary information; a 2x2 matrix is necessary to fully
characterize the two-dimensional variation, and is therefore called the covariance.
Because the covariance of a random variable with itself is simply that random
variable's variance, each element on the principal diagonal of the covariance matrix is
the variance of one of the random variables. The covariance matrix of each Gaussian
component is thus a 2x2 matrix and can be described by three digits; a, b and c.
[𝑎 𝑏𝑏 𝑐
] = (𝑐𝑜)𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 𝑜𝑓 𝑎 2𝐷 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛.
𝑎 =𝑐𝑜𝑠2𝜃
2𝜎𝑥2
+𝑠𝑖𝑛2𝜃
2𝜎𝑦2
𝑏 = −𝑠𝑖𝑛2𝜃
4𝜎𝑥2
+𝑠𝑖𝑛2𝜃
4𝜎𝑦2
𝑐 =𝑠𝑖𝑛2𝜃
2𝜎𝑥2
+𝑐𝑜𝑠2𝜃
2𝜎𝑦2
3.5. Classification
The obtained Gaussian parameters or features are the key ingredients for the feature
vectors, the input for the classifiers. Different dimensions and different classification
algorithms will be used, depending on the type of anomaly that is sought after. In
chapter five, the results will be compared mutually and with the results of the classical
approach with spectral features.
3.5.1. Point anomaly: Unusual event
A point anomaly in this project is something unusual on the time scale of a minute.
That can be either one unusual Gaussian component (unusual event), or an uncommon
combination of them (unusual minute). First discussed is a single unusual Gaussian
component, assumed to be an underlying unusual audio event.
2D GMM Clustering
To understand how events behave in time, how many types of events exist or which
distribution they follow, all Gaussian components or 'events' of the entire dataset have
40
to be observed. Each event is consisting of five digits but to give an intuitively
understandable visualization, only the first two digits, which are the mean values, are
plotted in Error! Reference source not found.. Note that the missing minutes, which
were filled with zero values, have also been forced into five Gaussian components and
their mean values are thus plotted at zero amplitude.
Matlab graph 4: Mean values of all Gaussian components
This graph is so densely filled that a third dimension, in the form of a histogram, is
needed to reveal the distribution of the scatter points. As done in Matlab graph 5, it
shows that at high frequency values, the variance of the distribution is much lower,
which was not at all visible in Matlab graph 4. At high frequencies, there is expected
to have almost always similar events, while medium to medium-low frequencies seem
to include many different possible events. Low frequencies again tend to aggregate
around only one main type of event.
41
Matlab graph 5: Histogram of mean values of all Gaussian components
The histogram suggests again a combination of Gaussian distributions. A Gaussian
Mixture Model of these mean values is generated and shown in both 2D and 3D in
Matlab graph 6 and Matlab graph 7. The AIC criterion assigns 15 clusters.
42
Matlab graph 6: GMM of mean values of all Gaussian components (2D)
43
Matlab graph 7: GMM of mean values of all Gaussian components (3D)
5D GMM Clustering
So far, the Gaussian components have only been clustered based on their mean values.
The same technique of GMM clustering will be applied on the five dimensions of
each Gaussian component. Visually, this is hard to represent. Therefore, after the
clustering process, the mean values of the centroids are plotted in Matlab graph 8 and
those can visually be compared to the 2D clustering. The AIC criterion assigns 57
clusters.
Matlab graph 8: 2D presentation of 5D GMM clustering
5D GMM Clustering (Standardized)
The input (feature vectors) of the previous 5D GMM Clustering, were the original
values of the mean and covariance. Two independent measures are thus combined in
one feature vector. Their scale differs thus they wont be treated equally important
when clustered. The features with the highest absolute range have more influence in
the clustering process then those with a smaller absolute range. Feature scaling
44
rescales all features to the same significance level. The two possible feature-scaling
techniques are standardization and normalization.
Standardization or Z-score normalization rescales the features so they have a normal
distribution with 𝜇 = 0 and 𝜎 = 1 . The standard scores, also called Z-scores, are
calculated as follows:
𝑧 =𝑥 − 𝜇
𝜎
The mean value 𝜇 and the standard deviation 𝜎 are calculated for the first two digits
or columns of all feature vectors, and for the last three digits or columns, not
separately for each digit of the feature vector, as the mean values are supposed to be
on one scale and the covariance’s on another.
An alternative approach is the Normalization or Min-Max scaling. In this approach,
the data is scaled to a fixed range, usually 0 to 1. The cost of having this bounded
range, is that the standard deviations are much smaller, which supresses the effect of
outliers and goes against the goal of the clustering process: the search for outliers. A
Min-Max scaling is typically done via the following equation:
𝑋𝑛𝑜𝑟𝑚 =𝑋 − 𝑋𝑚𝑖𝑛
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
Standardization or Z-score normalization is the best approach for this project. After
clustering, the data is scaled back to the original scale for the visual representation
and comparison. Again, a 2D plot of the resulting clusters is constructed, in Matlab
graph 9. The AIC criterion assigns 42 clusters.
45
Matlab graph 9: 2D presentation of 5D GMM clustering (standardized)
2D iK-means Clustering
A very popular technique for unsupervised classification is the K-means algorithm. It
is one of the simplest algorithms and works really well with large datasets, large
numbers of clusters and large feature vectors. The major drawback of the classical K-
means algorithm is that the number of clusters K, must be predefined. As described in
the algorithm section above, the intelligent K-means or iK-means solves this and will
be applied. It is difficult to have an understanding of what happens and how the
Gaussian components are clustered, because K-means only defines borderlines
between clusters with no indication of probability distribution inside each cluster.
Matlab graph 10 shows that the intelligent K-mean (2D) assigns 15 clusters.
46
Matlab graph 10: K-means defined clusters
Each cluster contains at least one value according to the algorithm, so no empty zones
are shown. It is rather unfortunate that every space of the graph must be part of a
cluster. Some clusters are thus almost empty and occupy big areas, in which data
points are easily far away from their centroid and thus seen as an anomaly, while they
might be close to the border of more occupied clusters and thus not far away from the
overall point of gravity. The details are discussed in the results chapter Error!
Reference source not found.. The results of the iK-means clustering are not used for
classification.
5D iK-means Clustering
For the same reasons as described in the section about 2D iK-means Clustering, 5D
iK-means clustering is not used for classification. The results of the found anomalies
and the explanation of their quality are discussed in Error! Reference source not
found..
Conclusion Clustering techniques
47
A detailed evaluation of the different techniques is done in chapter 5.1. The
conclusion is that 5D GMM clustering, not standardized, is the best cluster technique.
Therefore, 5D clustering (not standardized) will be used as a base for each type of
classification.
3.5.2. Point anomaly: Unusual minute
To detect unusual minutes, or an unusual combination of Gaussian components,
statistical methods are used, based on the previous clustering of each Gaussian
component. Each audio event is now reduced to its assigned cluster number, thus
point anomalies in the form of unusual events are 'erased' and do not influence this
classification. Consequently, each minute is described by 5 digits: the index numbers
of the clusters to which each component is assigned to. Two different approaches can
be distinguished, based on the mutual dependence assumption of the Gaussian
components. They are assumed to be either independent or dependent of each other.
The anomalies are calculated by joint correlation or joint probability respectively.
Joint Probability
This classification assumes that individual audio events or Gaussian components have
no relation to each other. For example, rain has nothing to do with how much traffic
there is at that moment. The joint probability per minute is calculated by multiplying
the frequency probability of each cluster that occurs in that minute. All those joint
probability values are assumed to be normal distributed and a certain confidence
interval serves as a threshold for 'unusual' minutes to be detected. In other words, if
the joint probability of a minute is very low, it is seen as an anomaly. This technique
is performed on the results of both 2D and 5D GMM clustering. For the results and
plots of these anomalies, there is referred to 5.2.1.
Joint Correlation
On the other hand, Gaussian components or audio events can be assumed correlated.
For example, when it rains heavily, you do not expect a lot of human voices in the
streets. The correlation matrix for these events is calculated and for each minute, the
correlation coefficient of every combination of clusters that occurs is summed up.
This one value per minute is called the joint correlation. Again, these values are
assumed to be normal distributed and with a certain defined threshold, anomalies are
48
detected. Very low joint correlations mean that many even negatively correlated
events occur at the same time. For the results of the unusual minutes based on joint
correlation, there is referred to 5.2.2.
3.5.3. Contextual anomaly
Contextual anomalies are normal events that occur in a strange context of time. Time
is this time the deciding factor for anomalies, not an outlying Gaussian component.
For this reason, all five Gaussian components are assigned to their cluster and
replaced by that centroid's mean parameters. The centroids used are generated by 5D
GMM clustering, not standardized. Each feature vector thus consists out of 12 digits,
five times two mean values and two digits that represent the hour of the day in a
continuous way. The time digits are defined as follows. Imagine a clock that is
divided in 24 hours instead of 12, as in Figure 8: 24-hour clock. One single pointer
can define the time of the day and its x and y coordinates describe that time in a
continuous manner. The continuity is important because for example, 23h is in reality
close to 00h, but mathematically those are the furthest away from each other and
would result into wrong results when clustered because clustering makes use of
Euclidean distance. As an example, 6h becomes [1,0], 12h becomes [-1,0], and so on...
The time is not discretized to hours but approaches continuity as also minutes and
seconds are encountered in the pointer.
Figure 8: 24-hour clock
Classification is now done by 12D GMM clustering, but this time, all data must be
standardized. It is quite obvious that the scale of the mean values differs from the
clock coordinates, as those axes are freely chosen. Standardization puts all these
49
values on the same scale, making sure that the time factor remains significant. For the
results of contextual anomalies is referred to chapter 5.3.
3.5.4. Anomaly Threshold Definition
Possible System Outcomes
In general, anomaly detectors are not perfect. Specifically, anomaly detectors
typically navigate a trade-off between two kinds of errors:
False Positives - Type 1 error: A type 1 error occurs when an anomaly detector
incorrectly rejects a benign input. A high false positive rate can significantly impair
the utility of the anomaly detector. Each false positive denies some part of the
functionality of the system to the user.
False Negatives - Type 2 error: A type 2 error occurs when an anomaly detector
incorrectly accepts a malicious input, leaving the system open to attack.
Making an anomaly detector more sensitive increases the false positive rate and
decreases the false negative rate (and vice-versa). Appropriately balancing these two
rates is therefore essential in obtaining an effective anomaly detector. Anomaly
detectors are typically tuned by running the detector on a representative set of inputs
to develop an intuitive understanding of how the anomaly detector will operate in
practice. Current techniques are ad-hoc, not guided by any theoretically well-founded
framework or analysis, and it therefore provides no guarantees on the effectiveness of
the anomaly detector and no guidance on how to effectively test the anomaly detector.
In this thesis, no representative set of inputs is available as a base to hard code a
threshold. Instead, the interaction of a human supervisor and a constantly adapting
threshold, should improve the accuracy of the system over time with the goal that the
supervisor only 'confirms' true positives.
Human defined anomalies
The first step in defining the anomaly threshold is a clear understanding of the
anomalies themselves. Different types of anomalies can be distinguished:
Binary anomalies are easy to understand. The anomaly that is sought after is well
known and unanimous agreed anomalous, for example the occurrence of cancer in a
50
person, or the presence of an oil reservoir at a geographic location. According to the
consequences of both false positives and false negatives, their desired probability rate
can be defined and its associated threshold for anomaly detection. For the detection of
cancer for example, the consequence of a false assignment is enormous, the life of a
person depends on it and a false negative could be fatal. On the other hand, a false
positive is more of a moral mistake, but wont affect the person’s physical health.
Therefore it is likely to have a higher false positive rate and a small false negative rate.
Another example is the detection of oil fields. You would rather want to be very sure
that there is in fact an oil field before huge investments on the drilling installations are
made. In this case, false positives should be low compared to false negatives. A false
positive assumes an oil field discovery where there isn't, which implies a huge money
loss.
Non-binary anomalies are less easy to interpret. The anomaly that is sought after is
not exactly defined, but subject to the opinion of analysers and supervisors. It is
therefore difficult to measure the false positive and false negative rates and
consequently to define the preferred ratio. A good example is any type of anomaly
detection in high dimensional spaces. Anomalies are now anything that differs from
normal behaviour, whatever that might be. It is unclear in advance which anomalies
are likely to be harmful and which anomalies are totally acceptable. Only by
supervising and assigning the different types of anomalies, a more consistent and
general statement can be made according to the false positive and false negative ratio.
For urban environments, harmful or suspicious events are assumed to be very rare.
According to the United Nations Office on Drugs and Crime [28], one murder and
113 rape cases per 100 000 habitants are reported in Paris, good for 2508 cases yearly.
There are no direct numbers available for pickpockets, but different unofficial reports
point Barcelona as the city in the world with the highest pickpocket rate of 6000 a
year. Paris is stated as the fifth highest pickpocket city but no specific numbers are
provided so Paris is for the example assumed to be as bad as Barcelona. Furthermore,
Not all crime cases are reported so in total, roughly 30000 cases a year are estimated.
The area of a single random microphone placed in an urban environment is around
2500 m2, hindrances taken into account. In other words, each microphone could report
0.5 suspicious events a year. These thoughts and calculations are far from accurate
51
and totally useless from a scientific point of view, but they are interesting to keep in
mind when defining a threshold for anomaly detection. They indicate that false
negatives in the sense of crime are very unlikely to occur (assuming that the system's
recognition abilities are on point). However, crime and violence are far from the only
anomalies of interest. The false positives will be crucial in the decision of the final
anomaly threshold.
For every type of anomaly sought after, feature vectors get assigned with an anomaly
measure. Whether it is the probability density function for single components or the
joint probability or joint correlation for combined occurrences, each clustering
technique has its anomaly measure. The feature vectors or input entities of the entire
dataset are ordered by that measure, in ascending order. In other words, the first in the
list is the 'strangest' case according to the computer, and along the list, cases get more
'normal'. The difficulty in this project is that the anomalies cannot be transformed
back to their original audio waves, which excludes the more intuitive acoustical
supervision. Instead, the corresponding spectrograms are plotted and subdue to only
visual interpretation.
Threshold
At first, the software system defines a list of positives, defined by an initial threshold.
That list contains both true positives and false positives. The only way to distinguish
the true positives from the false positives is by human supervision. Human
supervision is time consuming and expensive, thus the list to be supervised must be
limited, yet contain as much true positives as possible. So where to set the initial
threshold? This trade-off can be translated in function of one single measure, cost.
Hereby, a very important assumption must be made: the human supervisor is always
right, his or her verdict is assumed to be correct. For the optimal cost analysis, the
parameters to set, or the questions to be answered, are the following:
- What is the cost of a false negative?
First of all, remember that a false negative means that an anomaly occurred
but did not get detected. This means that the anomaly never got notified to the
supervisor, the system simply put its threshold too narrow to detect it. It means
a failure of the system and is in fact the worst possible outcome that must be
52
avoided by all means. The cost of an undetected anomaly can be a direct cost,
such as a damage claim for a harmed person or object due to the lack of
intervention. It can also be indirect, think about the loss of confidence in the
system and thus a future loss in sales. The total cost for a false negative can be
written as the sum of the direct cost and indirect costs: 𝐶𝑓𝑛 = 𝐶𝑑 + 𝐶𝑖.
- What is the cost of a true negative?
A true negative is a neutral outcome. It means that no harmful situation
occurred and also no supervision had to take place. This cost could be
interpreted as a negative cost because it provides only benefits. A revenue and
profit study is however beyond the scope of this thesis and therefore the cost
of a true negative is set to zero and does not contribute to the equation.
- What is the cost of a false positive?
Each positive, whether false or true, is subject to human supervision. A false
positive is a harmless situation detected as harmful. The operation cost is thus
a full cost because it is totally unnecessary. The cost of a false positive can be
written as: 𝐶𝑓𝑝 = 𝐶𝑜𝑝.
- What is the cost of a true positive?
A true positive is morally the least desirable situation, however, it is exactly
the purpose of the software development and a confirmation of the accuracy of
the system. Operation costs occurred, but they are compensated by direct and
indirect revenues. Direct revenues are for example more expensive policies in
environments that have an increased exposure to danger, while indirect
revenues can be the growth of the credibility of the system. The net cost is
negative because customers are willing to pay for this outcome. It can be
written as: 𝑅𝑡𝑝 = 𝐶𝑜𝑝 − 𝑅𝑑 − 𝑅𝑖.
It is easy to see that true positives and true negatives are preferred outcomes, while
false positives and negatives involve only costs. Basically, a failure of the system is
bad but unfortunately impossible to exclude. The two errors are negatively correlated,
thus a trade-off must be made. False negatives can involve damage to humans, so
their cost gets weighted by a factor H (0<H<1) depending on the environment, while
53
false positives only bring financial costs. To define the threshold, the following
equation is used:
𝑚𝑖𝑛 𝐹𝑁
𝐻∗ 𝐶𝑓𝑛 + 𝐹𝑃 ∗ 𝐶𝑓𝑝 + 𝑇𝑃 ∗ 𝑅𝑡𝑝
or
𝑚𝑖𝑛 𝑓𝐹𝑁(𝑡)
𝐻∗ 𝐶𝑓𝑛 + 𝑓𝐹𝑃(𝑡) ∗ 𝐶𝑓𝑝 + 𝑓𝑇𝑃(𝑡) ∗ 𝑅𝑡𝑝
where:
𝐹𝑁: number of False Negatives
𝐹𝑃: number of False Positives
𝑇𝑃: number of True positives
𝐻: factor for moral damage to human beings
𝐶𝑓𝑛: Cost per false negative
𝐶𝑓𝑝: Cost per false negative
𝑅𝑡𝑝: Revenue per True positive
Note that FN, FP and TP are actually functions of t: threshold. Whereby 𝑓𝐹𝑃(𝑡) is
exponential and 𝑓𝑇𝑃(𝑡) and 𝑓𝐹𝑁(𝑡) are logarithmic. This equation however, can only
be solved when all parameters, except for the threshold, are known. In reality, it
almost never happens that they are known beforehand, thus different thresholds must
be set to learn the functions.
Machine Learning Threshold
The previous equation is stationary, that means that 𝑓𝐹𝑃(𝑡), 𝑓𝑇𝑃(𝑡) and 𝑓𝐹𝑁(𝑡) do not
change over time. The dependency on a human supervisor does not change. In fact, in
the same matter as the system in this thesis learns what normal Gaussian components
are, the system can learn from the supervisor's opinion and study the characteristics of
the true positives and false positives. The next time that a positive comes in, the
system applies a second classification based on the listed true positives and false
positives. Only after that second classification, the supervisor gets notified. This
significantly decreases the need of the supervisor and increases the accuracy of the
54
system. The functions now evolve to their best performance and the equation can be
applied for an optimal anomaly recognition rate.
55
4. Classical approach: Spectral
Features
A more classical approach based on spectral features is performed to compare end
results and increase insight in the main approach.
4.1. Feature Extraction
4.1.1. Spectral Features
Each spectrogram is described by nine characteristics, low-level spectral features. The
nine features are; Spectral Energy, Spectral Centroid, Spectral Spread, Spectral Roll-
off Point, Spectral Entropy, Spectral Kurtosis or flatness, Spectral Skewness, Spectral
Slope and Noisiness. The calculation of each one of them can be found in A.1 or more
specifically in the Matlab code A.2.8. The exact choice of those features is trivial
because Principle Component Analysis will decorrelate them and then recombine
them to valuable features.
4.1.2. Principal Component Analysis (PCA)
In general, Principal Component Analysis or PCA allows obtaining a linear subspace
of the original data. It uses an orthogonal transformation to convert a set of
observations of possibly correlated variables (features) into a set of values of linearly
uncorrelated variables called principal components. Principle components reveal the
underlying structure of the data, they are the directions where there is the most
variance, also called eigenvectors. However, variance is an absolute number, not a
relative one. This means that the variance of some features will be much larger than
others, just because of the scale of calculated feature. The largest variance, and thus
the largest eigenvector, will implicitly be defined by the first feature if the data is not
standardized. To avoid this scale-dependent nature of PCA, the data is centered and
scaled (standardized) by subtracting each feature with its mean and dividing this
subtraction by its standard deviation.
Furthermore, PCA assumes that the underlying components or features are normal
distributed. If they are normal distributed, then PCA actually acts as an independent
56
component analysis since uncorrelated Gaussian variables are statistically
independent. However, if the underlying components are not normally distributed,
PCA merely generates decorrelated variables, which are not necessarily statistically
independent. In this case, non-linear dimensionality reduction algorithms, such as
Independent Component Analysis (ICA), might be a better choice. The distribution of
each of the spectral features rejects the hypothesis of normal distribution in the one-
sample Kolmogorov-Smirnov test. However, an approach to normal would be
sufficient. The total size of the data are around 262 million spectrograms,
consequently the processing is slow. A sample is taken and its distribution plotted in a
histogram. The sample size depends on the following four factors; population size,
margin of error, confidence level, response distribution (or standard deviation). The
necessary sample size is then:
(𝑍 − 𝑠𝑐𝑜𝑟𝑒)2 ∗ 𝑆𝑡𝑑𝐷𝑒𝑣 ∗ (1 − 𝑆𝑡𝑑𝐷𝑒𝑣)/(𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟)
Note that the population size is not included in the equation because above 20000, the
population does not matter and the equation is therefore a simplified version. For a
95% confidence interval, 0.5 standard deviation and a margin of error of 5%, the
recommended sample size is 385. A margin is taken into account and a sample size of
1000 is used, randomly selected without replacement. The distribution of each of the
nine spectral features is shown below.
57
The major part of the features is close enough to normal distribution, which makes
PCA a valuable technique for feature reduction.
After standardizing each feature individually and applying the Principle Component
Analysis, the difficulty lies in determining the number of factors or components that
account for a large amount of the overall variance. To that end, two stopping rules
have been considered to determine when to stop adding factors:
Amount of explained variance: based on this, the chosen factors should explain 70%
of your total variance at least. To understand the meaning of “total variance” as it is
used in a principal component analysis, remember that the observed variables are
standardized, which means that each variable has a mean of zero and a variance of
one. The “total variance” in the data set is simply the sum of the variances of these
observed variables and each observed variable contributes one unit. This rule is not
very useful for this project as it is not clear if each variable has a useful contribution
to the data.
Kaiser's stopping rule/Eigenvalue one criterion: based on this criterion you choose
the components with eigenvalues higher than 1. As shown in Table 1, only two newly
composed components have an eigenvalue higher than one. However, two
components instinctively seem very few and the third value of 0,6135 is significant.
Therefore, three components are selected.
3,6044 1,4307 0,6135 0,2058 0,1016 0,0416 0,0147 0,0054 0,0002
Table 1. Eigenvalues of spectral features
4.2. Classification
4.2.1. Point anomaly
Only one type of anomalies is detected with the classical approach based on spectral
features. After classification, each spectrogram will be labelled as normal or abnormal.
The following step is to search in a temporal direction for the amount of abnormal
spectrograms in a certain timeframe. To be able to compare the results to the main
approach, a frame of one minute, or 480 spectrograms will be used. No overlap is
used again to simulate the minutes of the main approach. If the amount of abnormal
58
spectrograms in a certain frame exceeds a certain threshold, that frame or minute will
be classified as a point anomaly.
Gaussian Mixture Model
The feature vectors now consist out of three digits and form the input for
classification. The distribution of the new features should be Gaussian, if all
underlying features were normal distributed. As known, the underlying features are
not normal distributed so again their histogram is plotted out to get a closer view. In
case they would be perfectly normal distributed, a GMM with only three clusters
would be needed.
The histograms reveal that the new composed features are not perfectly normal
distributed. For this reason, the number of clusters is determined by the AIC criterion
and results in seven clusters.
Temporal Smoothing
The basic idea of smoothing is to filter out noise or random anomalies and only real
anomalies remain. Consider the following sequence of spectrograms whereby 'n'
stands for normal and 'a' for anomalous: n-n-n-n-n-n-n-n-n-n-a-n-n-n-n-n-n-n-n-n-n-...
It is obvious that the one anomaly in a sequence is due to noise and will get labelled
as normal. In this research, temporal smoothing is not used. Instead, a more simple
approach is used which counts the number of anomalies per minute. A threshold of a
certain amount of anomalies per minute will define whether the minute is anomalous
or not.
5. Results new approach
In 3.5.4 Anomaly Threshold Definition, there is stated that all feature vectors are
ordered and plotted in descending order of them being anomalous. Practically, it
would be ridiculous concerning to computational efforts and clearness for
59
interpretation plot all of the feature vectors. Only the 0.05 percentile most anomalous
is selected as a candidate anomaly. The exact amount of anomalies representing that
0.05 percentile is reported below for each anomaly type.
5.1. Unusual events
5.1.1. GMM 2D
Error! Reference source not found. visualizes the 0.05 percentile anomalous
Gaussian components, based on their mean values only. This 2D presentation is
intuitively very obvious, low probability density functions are seen as anomalous.
Matlab graph 11: mean values of 1257 anomalies defined by 2D GMM
Anomaly types
The underlying spectrogram of each of these anomalous Gaussian components is
plotted out in their occurring minute, starting with the most anomalous one. The
Gaussian components are also plotted to see if it gives a proper estimation of the
60
spectrogram and an accurate anomaly indication. Among the most anomalous
Gaussian components, four types can be distinguished;
Anomaly type 1: High amplitudes at high frequency values Matlab graph 12. The
Gaussian components fits the spectrogram quite accurately and is thus a correct
anomaly detection.
Matlab graph 12: Anomaly type 1: high values at high frequencies
Anomaly type 2: High amplitude spread at high frequency, shown in Matlab graph 13.
These peaks occur mostly at a single frequency band, but sometimes over several
adjacent frequency bands. They are described by a dedicated Gaussian component and
correctly reported anomalous.
61
Matlab graph 13: Anomaly type 2: High amplitude spread at high frequency
Anomaly type 3: High amplitudes at low frequencies, shown in Matlab graph 14.
These occur mostly over two frequency bands and gets a dedicated Gaussian
component assigned.
62
Matlab graph 14: High amplitudes at low frequencies
Anomaly type 4: Low total amplitude variability, shown in Matlab graph 15. It
frequently happens that one Gaussian component covers almost all frequency bands.
As visible in the figure below, that Gaussian component represents the spectrogram in
combination with the other Gaussians, but on itself, it does not fit the data points and
thus does not represent a certain underlying audio event. These kinds of minutes
probably need another number of Gaussian components for accurate data fitting. They
are indeed anomalous but they drop the subject of a replacing model for data
63
representation.
Matlab graph 15: Anomaly type 4: Low total amplitude variability
Model shortcomings
Anomaly type 4 states an important issue of the very fundamental basis of the model.
Each minute is different and might need another number of Gaussian components to
optimally describe it. While 95% of the minutes point to 5 components as the optimal
amount, the remaining 5% tends to another number. Evidently, these 5% are
anomalous in a way, therefore to our interest but described inaccurately. The
following figures show some of these anomalies.
Example 1: Event switch during the minute, shown in Matlab graph 16 and Matlab
graph 17. By restricting the amount of components to five, the model is forced to
'merge' different audio information into one component, reducing its accuracy.
64
Matlab graph 16: Event switch during minute
65
Matlab graph 17: Event switch during minute
Example 2: Inaccurate data fitting Error! Reference source not found. and Error!
Reference source not found.. Because the data is not perfectly normal distributed, an
overlapping Gaussian component may cover some of that asymmetric skew. However,
that Gaussian on itself does not represent any underlying audio events and that minute
might be better described by less components.
Matlab graph 18: Inaccurate data fitting
66
Matlab graph 19: Inaccurate data fitting
Threshold
There is no need to define a threshold for anomalies because this technique will not be
used. All techniques are first compared mutually in 5.1.4 2D vs. 5D vs. 5D
standardized and with the classical approach in 6 Results Classical Approach.
5.1.2. GMM 5D
Both the mean values and the variances of the Gaussian components now count for
clustering. In other words, the feature vector exists out of five digits instead of two.
The 2D plot in Matlab graph 20 only represents the mean values of the anomalies and
immediately states that the variance has a significant impact on clustering.
67
Matlab graph 20: Mean values of 1257 anomalies defined by 5D GMM
Types of anomalies
Anomaly type 1: Microphone failure or resumption, shown in Matlab graph 21. The
minute at which a microphone fails or resumes, gives both zero values and non-zero
values in one single spectrogram. The zero values strongly affect the Gaussian
components, especially in the sense of variance. Note that this happens due to the hard
coded number of Gaussian components, as otherwise it might fit an additional
dedicated Gaussian component with a normal behaving variance, not resulting
anomalous.
68
Matlab graph 21: Anomaly type 1: Microphone failure or resumption
Anomaly type 2: All other anomalies, examples in Matlab graph 22, Matlab graph 23,
Matlab graph 24. In essence, all pointed anomalies are spectrograms that have
problems being captured accurately in five components. Data points with different
distributions have to be squeezed into a single component and result in unlikely
variances. Matlab graph 22 is an example of a spectrogram that needs more
components to be described accurately. Also Matlab graph 23 obviously would split
the skew of the low frequency values and the noise at the high frequency values into
different components. Matlab graph 24 is less straightforward. It even seems that the
Gaussian component is redundant.
69
Matlab graph 22: Anomaly type 2
70
Matlab graph 23: Anomaly type 2
Matlab graph 24: Anomaly type 2
Model shortcomings
Each one of the anomalies represents a bad fit and points the weakness of the system.
The problem is that the anomaly does not really point to specific underlying data. It
seems that most of the anomalies need more Gaussian components to describe the
minute accurately. When the AIC criterion is applied on all anomalies, the average of
nine comes out as optimal amount of components. The question rises whether five
Gaussian components, although optimal for 94% of the minutes, is actually the good
amount to work with. To test and clarify this uncertainty, all minutes are recalculated
from the beginning, this time with nine components. Of those newly described
minutes, the anomalies are calculated in the exact same way as is done with five
components. Below, some of the anomalies are shown in.
71
Matlab graph 25: Anomaly based on 9 Gaussian components
72
Matlab graph 26: Anomaly based on 9 Gaussian components
Matlab graph 27: Anomaly based on 9 Gaussian components
A new problem arises, namely over fitting. Because nine components are optimal only
for a few minutes, all too often there are Gaussian components that take unusual
variances because of the over fitting, not because of an underlying anomaly. The idea
of nine Gaussian components for all minutes is abandoned.
Reconstruction of GMM on anomalies
When supervising the anomalies based on five Gaussian components, as noted before,
it is obvious that many of those minutes are actually under fit, i.e. more Gaussian
components are needed to describe that minute. This means that probably many
minutes are not even anomalous, but only labelled as anomalous due to a misfit. To
solve this, the assigned anomalies with five Gaussian components get a second 'treat'.
The GMM on those minutes is reconstructed, without hard coding the number of
Gaussian components per minute. Each minute gets thus assigned its optimal amount
of components. Each of these newly constructed components goes through the
classifying process and the remaining anomalous Gaussian components are observed.
Of the 1257, only 265 anomalies remain. Those 265 anomalies are compared to the
73
supervision applied on the 1257 anomalies. Only 28% of the 265 is defined true
positive by supervision, doing worse than the original anomaly selection and
suggesting many false negatives if this technique is followed. Consider the following
minute:
Matlab graph 28: anomalous Gaussian out of 5
74
Matlab graph 29: anomalous Gaussian out of 9
The second fit is almost nothing better than the first fit, yet needs a lot of computation
power to be obtained. The same happens for all almost all minutes. Another example
is shown below.
75
Matlab graph 30: anomalous Gaussian out of 5
76
Matlab graph 31: anomalous Gaussian out of 9
The second round with a customized number of Gaussians applied to each anomalous
candidate is abandoned as it does not give better results.
Threshold
An initial threshold is set on a probability density function or pdf value of 7.77e-13.
All 1257 results are reviewed by the study group and around 70% of those are true
positives, leaving 30% of false positives. An interesting fact is that the number of true
positives does not decrease towards the end of the list, suggesting that there are many
false negatives and the threshold must be broaden. When the threshold is changed a
couple of times, a regression function can be made and an optimal threshold can be
set. From there on, all false positives join a new population, on which a similar 5D
GMM classifier is applied. Every newly incoming possible anomaly now goes
through the second classifier and when it is still suggested a true positive, it is
supervised.
5.1.3. GMM 5D standardized
77
Matlab graph 32: Mean values of 1257 anomalies defined by 5D standardized GMM clustering
On first sight, the anomalies seem to be quite similar to those created by not
standardized 5D clustering. Therefore, the anomalies of the three techniques are
compared not only with each other, but also with a totally different, more classical
approach based on spectral features. The mutual comparison of techniques in 5.1.4
and the comparison with the classical approach in 6, point 5D not standardized as
optimal technique. The standardized results are thus not discussed.
5.1.4. 2D vs. 5D vs. 5D standardized
The anomalies assigned by each of the three different cluster dimensions are
compared below. The order is not taken into account, only the presence of the same
minute in all the detected anomalous minutes.
2D 5D 5D standardized
2D 100% 34,4% 35,9%
5D 34,4% 100% 37,4%
5D standardized 35,9% 37,4% 100%
Table 2: Comparison 2D, 5D and 5D standardized clustering
The table confirms the logical reasoning that 5D clustering without standardization
and 5D clustering with standardization equal the most, while 2D clustering equals 5D
with standardization more because the standardization process enlarges the impact of
the mean versus the variance. Unfortunately, this information does not reveal which
of the techniques is best. Therefore, the decision is based on visual supervision of the
spectrograms and additionally, all techniques are compared with a totally different
approach; the classical approach based on spectral features. Based on supervision, the
not standardized method is preferred by professor Botteldooren because the measures
of both mean and variance are expressed in the same units and standardize both of
them independently might mess up their relationship. For the comparison with the
classical approach is referred to 6. Also there, 5D not standardized clustering comes
out as the best approach.
5.2. Unusual minutes
78
5.2.1. Joint Probability
The results of the joint probability are still due to supervision. The method of
detection is assigned in 3.5.2. 35,7% of the assigned anomalies also occur in
anomalous events.
Matlab graph 33: Unusual minute based on joint probability
5.2.2. Joint Correlation
Also unusual minutes based on joint correlation are still due to supervision and can be
treated by the same supervision/threshold system as anomalous events. None of
anomalies also occur in unusual events. Compared to the anomalies assigned by joint
probability, they seem less anomalous. An example is shown below.
79
Matlab graph 34: Unusual minute based on joint correlation
5.3. Contextual anomalies
Contextual anomalies are defined based on their timing, for this reason, they appear to
be normal visually. A human supervisor that knows the environment or
neighbourhood should revise these. It is rather impossible for me as an individual
without further data of the setting of the microphone to decide whether that minute is
actually anomalous. As visible in the Matlab graphs, the minutes do not appear to be
unusual.
80
Matlab graph 35: unusual minute by their time context
6. Results Classical Approach
The Classical Approach has some shortcomings compared to the new approach. First
of all, the program is very heavy to run, too heavy in fact. Sampling is necessary,
which decreases the quality of the result. Furthermore, the technique does not
encounter any temporal relationships, while the new approach with GMM's per
minute describe both spectral relationships and temporal relationships into one single
model. As this technique needs much more research to make it fully valuable, the
spectrograms of the anomalies are not studied. This setup is only used to get more
insight into the new approach, and of course also for the sake of learning the
algorithms that are used.
6.1. Point anomalies
As stated before, only point anomalies are defined by this technique. The same
percentile of 0.05% is used, which again results in 1257 anomalous minutes, just as
much as in the new approach. Now it is of ultimate importance to compare the dates
81
of the anomalies. Each row [year month day hour minute] of the classical approach
anomalies matrix is looked up for existence in the matrix of the new approach. Keep
in mind that the two techniques totally differ in their ideology; where the new
approach searches for one anomalous Gaussian component per minute (out of 5), the
classical approach considers the total minute, not just a fragment of it. For this reason,
significant differences are expected. The table below shows the comparison results.
2D 5D 5D standardized
Number of anomalies 1257 1257 1257
Matching anomalies 332 435 377
Percentage matching 26,4% 34,6% 29,9%
Table 3: Comparison classical approach with main approach
5D without standardization comes out best. Also visually it is the most promising
method. For this reason, 5D clustering is used as basis for unusual minutes and
contextual anomalies.
7. Model Extension
This Chapter is an extension of the main model, to get a better understanding about its
flexibility multi applicable character. The concept is to take all nodes (microphones)
as input for the feature extraction. The classifier is applied only onto node 275, the
subject node of this thesis. The same threshold should now assign way less anomalies
and reveal the 'general' anomalies instead of only the local anomalies.
7.1. Feature Extraction
Due to time constraints, not all data of all nodes is processed. The following nodes are
included in the extension: 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262,
264, 265, 266, 268, 273 and 274. Per node, the features of around six months are
generated by GMM's. The manner is the exact same as described in 3.4.2; Again, each
minute is described by five Gaussian components, consisting of five digits, two for
the mean value and three for the variance.
7.2. Classification
82
The features of all those nodes are clustered in the exact same manner as in 3.5. Then,
the feature vectors of node 275 are classified by this new model and the results are
compared. Unfortunately, the program could not finish before the due date of the
written report. Therefore, the results will be sent to the study group and they will be
included in the public defense.
8. Conclusion and Outlook
8.1. Conclusion
Acoustic information is a highly valuable source of information for environmental
context awareness. Based on the human ability to interpret sound signals in an
effortless way, this research aims to develop a system that performs as close as
possible to the performance of a human supervisor. One of the main difficulties of this
thesis is that the transformed data is not reversible to its original audio waves, which
makes acoustic supervision impossible. Another difficulty is that the data is of
significant size, calling for computational efficient techniques and creative thinking.
This thesis only uses the incoming data as input, without other assumptions, labelled
datasets, or metadata. This unsupervised approach has the advantage that all results
are directly originating from the input data, no other knowledge can be mistakenly
applied. The model that this thesis proposes makes use of Gaussian Mixture Models
for feature extraction. More specifically, all the spectrograms of a certain timeframe
are modelled by one GMM. This approach does not only allow significant data
reduction, it also captures both spectral relations and temporal relations in a single
model. The newly created features, which are the parameters of these Gaussian
components now serve to form different types of feature vectors, depending on the
type of anomaly that is sought after.
When looking at the results of the unusual events, the created model fits the data very
accurately, and where it does not, a supervisor helps classifying true positives and
false positives. The latter ones are input for another GMM classifier that is gradually
updated and not only replaces the human supervisor over time but also reduces the
total error rate.
83
Besides those anomalous events, also the combination of events within a timeframe
can behave unusual. These are however, still due to supervision but can be treaten
with a similar approach as the unusual events.
At last, totally normal moments from a acoustic point of view, can still be unusual
according time. A deeper understanding of the environment is necessary to examine
these and again, the false positives can become the input of a new classifier.
Instead of applying a hard threshold on the nomination of anomalies, a more intuitive
and morally correct technique is applied. The rate of false positives is initially taken
too high and human supervisor assigns each anomaly with a label: 'false positive' or
'true positive'. The false positives are stored and their characteristics are learned by
the system. This self-enhancement, also called machine learning, gradually decreases
the rate of false positives and increases the accuracy of the system.
Besides the significant data reduction, the speed of the program and the advantages of
unsupervised learning, another advantage of this research is that the developed
technique can be applied on any environment. The technique will learn the location's
specific features and increase accuracy levels with time.
8.2. Outlook
The duration of a thesis project allows only a certain deepness of research, so
evidentially, there is room for improvement.
A first possibility is to describe each minute by its optimal number of Gaussian
components and then apply the classification. This is however, a very computational
intensive approach, which can still operate faster real time speed, as soon as the
system is ready to run on newly incoming data. The training set however, is too big
for that approach. Furthermore, the level of difficulty increases significantly, as some
of the feature vectors are now of variable length. Although the accuracy of the system
would improve, it is doubtful if it compensates the increased complexity and
computational intensity.
84
Furthermore, conceptual anomalies are not addressed in this research. The GMM's
only encounter small-scale temporal relations, in between one minute and one day.
Although, the evolution of the environment over time is also very important and could
reveal trends, seasonality, ...
Another interesting topic for future work is to build taxonomies for different types of
environments. Instead of using a huge training set every time this program is applied
onto a new environment, the knowledge of likewise locations could be used to
converge faster and improve the level of anomaly accuracy.
85
Bibliography
[1] T. H. Park. (2010). Introduction to digital signal processing: Computer
musically speaking. World Scientific.
[2] W. A. Sethares, R. D. Morris, J. C. Sethares. (2005). Beat tracking of
musical performances using low-level audio features. Speech and Audio
Processing: IEEE Transactions on. Vol. 13, No. 2, pp. 275-285.
[3] G. Peeters. (2004). A large set of audio features for sound description
(similarity and classification). CUIDADO I.S.T. Project Report.
[4] M. McKinney and J. Breebaart (Oct 2003). Features for audio and music
classification. Proceedings of the 4th International Conference on Music
Information Retrieval (ISMIR 03). Baltimore, Maryland, USA.
[5] T. Andersson. (2004). Audio classification and content description. Lulea
University of Technology, Sweden.
[6] H. Misra, S. Ikbal, H. Bourlard, H. Hermansky. (2004). Spectral entropy
based feature for robust Automatic Sound Recognition in Acoustics, Speech,
and Signal Processing. (ICASSP). IEEE International Conference on. Vol. 1,
pp. I-193.
[7] S. Rossignol, X. Rodet, J. Soumagne, J.-L. Collette, P. Depalle. (s.a.).
Automatic Characterisation of musical signals: feature extraction and
temporal segmentation. IRCAM, Paris.
[8] B. Doval, X. Rodet. (1991). Estimation of fundamental frequency of musical
sound signals. Proc.ICASSP 5, pp. 3657-3660.
[9] L.R. Rabiner, B.H. Juang. (1993). Fundamentals of speech recognition.
Prentice-hall.
[10] C. Panagiotakis, G. Tziritas. (2005). A speech/music discriminator based on
rms and zero-crossings. Multimedia, IEEE Transactions on. Vol. 7, No. 1,
pp.155-166.
[11] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, T. Sorsa. (2002).
Computational auditory scene recognition. In Acoustics, Speech, and Signal
Processing. (ICASSP). IEEE International Conference on. Vol. 2, pp. II-
1941.
[12] H. Lu, W. Pan, N. D. Lane, T. Choudhury, A. T. Campbell. (2009).
Soundsense: scalable sound sensing for people-centric applications on
mobile Bibliography 90 phones. In Proceedings of the 7th international
conference on Mobile systems, applications, and services. pp. 165-178.
[13] T. Zhang, C-CJ Kuo. (2001). Audio content analysis for online audiovisual
data segmentation and classification. Speech and Audio Processing, IEEE
Transactions on. Vol. 9, No. 4, pp.441-457.
[14] D. Li, I. K. Sethi, N. Dimitrova, T. McGee. (2001). Classification of general
audio data for content-based retrieval. Pattern recognition letters. Vol. 22,
No. 5, pp. 533-544.
[15] L. Lu, H-J. Zhang, H. Jiang. (2002). Content analysis for audio classification
and segmentation. Speech and Audio Processing, IEEE Transactions on. Vol.
10, No. 7, pp. 504-516.
[16] R. Radhakrishnan, A. Divakaran, P. Smaragdis. (2005). Audio analysis for
surveillance applications. in IEEE WASPAA’05. pp. 158-161.
[17] J. Salamon, J.P. Bello. (s.a.). feature learning with deep scattering for urban
sound analysis. Center for urban science and progress. New York University.
86
[18] Coates, A.Y. Ng. (2012). Learning feature representations with K-means.
Neural Networks: Tricks of the Trade, 2nd edition, Springer LNCS 7700.
Stanford University.
[19] S. van den Oord, B. Dieleman, Schrauwen. (s.a.). Deep-content based music
recommendation. Ghent University: Electronics and Information Systems
department (ELIS).
[20] S. Ntalampiras, I. Potamitis, N. Fakotakis. (s.a.). On acoustic surveillance of
hazardous situations. University of Patras, Greece: department of Electrical
and Computer Engineering.
[21] Cotton, V. Courtenay, Ellis, P.W. Daniel. (2011). Spectral Vs. Spectro-
Temporal Features For Acoustic Event Detection. Colombia University:
Department of Electrical Engineering.
[22] R. Radhakrishnan, A. Divakaran, P. Smaragdis. (2005). Audio Analysis for
Surveillance Applications. Cambridge: Mitsubishi Electric Research Labs.
[23] Y. Hao, M. Shokoohi-Yekta, G. Papageorgiou, E. Keogh. (s.a.). Parameter-
Free Audio Motif Discovery in Large Data Archives. University of California,
Riverside.
[24] E. F. Gomes, F. Batista. (2015). Using Multiresolution Time Series Motifs to
Classify Urban Sounds. International Journal of Software Engineering and Its
Applications. Vol. 9, No. 8, pp. 189-196.
[25] H. Lee, Y. Largman, P. Pham, A. Ng. (s.a.). Unsupervised Feature Learning
for Audio Classification using Convolutional Deep Belief Networks. Stanford
University: Computer Science department.
[26] J. Salamon, J.P. Bello. (s.a). Unsupervised Feature Learning for Urban
Sound Classification. New York University: Center for Urban Science and
Progress, Music and Audio Research Laboratory.
[27] R. Cai, L. Lu, A. Hanjalic. (s.a). Unsupervised Content Discovery in
Composite Audio. Delft University of Technology: Department of
Mediamatics, Tshinghua University: Department of Computer Science.
[28] Global Study on Homicide. United Nations Office on Drugs and Crime, 201
1
A. Appendix
A.1. Features
A.1.1. Spectral Centroid (SC)
Spectral centroid represents the “balancing point”, or the midpoint of the spectral
power distribution of a signal. It is related to the brightness of a sound. The higher the
centroid, the brighter (high frequency) the sound is. A spectral centroid provides a
noise-robust estimate of how the dominant frequency of a signal changes over time.
As such, spectral centroids are an increasingly popular tool in several signal
processing applications, such as speech processing. Spectral centroid is obtained by
evaluating the “center of gravity” using the Fourier transform’s frequency and
magnitude information. The individual centroid of a spectral frame is defined as the
average frequency weighted by amplitudes, divided by the sum of the amplitudes. The
following equation shows how to compute the spectral centroid, SCi, of the ith
audio
frame.
𝑆𝐶𝑖 =∑ 𝑘. |𝑋𝑖(𝑘)|2𝐾−1
𝑘=0
∑ |𝑋𝑖(𝑘)|2𝐾−1𝑘=0
Here, Xi(k) is the amplitude corresponding to bin k (in DFT spectrum of the signal) of
the ith
audio frame and K is the size of the frame. The result of the spectral centroid is
a bin index within the range 0 < 𝑆𝐶 < 𝐾 − 1. It can be converted either to Hz (using
the following equation) or to a parameter range between zero and one by dividing it
by the frame size, K. The frequency of bin index k can be computed from the block
(frame) length K and sample rate fs by:
𝑓(𝑘) = (𝑓𝑠
𝐾)𝑘
Low results indicate significant low frequency components and insignificant high
frequency components (low brightness) and vice versa.
2
A.1.2. Spectral Spread (SS)
The spectral spread is the second central moment of the spectrum. It is a measure that
signifies if the power spectrum is concentrated around the centroid or spread out over
the spectrum. In order to compute it, one has to take the deviation of the spectrum
from the spectral centroid, according to the following equation:
𝑆𝑆𝑖 = √∑ (𝑘 − 𝑆𝐶𝑖)2. |𝑋𝑖(𝑘)|2𝐾−1
𝑘=0
∑ |𝑋𝑖(𝑘)|2𝐾−1𝑘=0
A.1.3. Spectral Roll-off Point (SRP)
The spectral rolloff point is the N% percentile of the power spectral distribution,
where N is usually 85% or 95%. The spectral rolloff point is the frequency below
which N% of the magnitude distribution is concentrated. It increases with the
bandwidth of a signal. Spectral rolloff is extensively used in music information
retrieval and speech/music segmentation. The spectral rolloff point is calculated as
follows:
𝑆𝑅𝑃 = 𝑓(𝑁) = (𝑓𝑠
𝐾)𝑁
Where N is the largest bin that fulfills:
∑ |𝑋(𝑘)|2 ≤ 𝑇𝐻. ∑ |𝑋(𝑘)|2
𝐾−1
𝑘=0
𝑁
𝑘=0
Where X(k) are the magnitude components, k the frequency index and f(K) the
(frequency) spectral roll-off point with (100*TH)% of the energy. TH is a threshold
between 0 and 1. A commonly used value for the threshold is 0,85 and 0,95. This
measure is useful in distinguishing voiced speech from unvoiced: Unvoiced audio has
a high proportion of energy contained in the high-frequency range of the spectrum,
whereas most of the energy for voiced speech and music is contained in the lower
bands.
A.1.4. Spectral Entropy (SE)
3
Spectral entropy is computed in a similar manner to the entropy of energy, although,
this time the computation takes placeS in the frequency domain. More specifically, we
first divide the spectrum of the short-term frame into L sub-bands (bins). The energy
Ef of the fth
sub-band, f=0,...,L-1, is then normalized by the total spectral energy, this
is:
𝑛𝑓 =𝐸𝑓
∑ 𝐸𝑓𝐿−1𝑓=0
The entropy of the normalized spectral energy nf is finally computed according to the
equation:
𝐻 = − ∑ 𝑛𝑓 . 𝑙𝑜𝑔2(𝑛𝑓)
𝐿−1
𝑓=0
A.1.5. Spectral Kurtosis or flatness
The kurtosis gives a measure of flatness of a distribution around its mean value. It is
computed from the 4th order moment. The kurtosis K indicates the
peakedness/flatness of the distribution. K=3 means a normal distribution, while K<3
means a flatter distribution and K>3 a peaker distribution.
A.1.6. Mel Frequency Cepstral Coefficients (MFCC)
MFCC originate from automatic speech recognition but evolved into one of the
standard techniques in most domains of audio recognition applications such
environmental sound classifications. They represent timbral information (spectral
envelop) of a signal. Computation of MFCC includes conversion of the Fourier
coefficients to Mel-scale. After conversion, the obtained vectors are logarthmized,
and decorrelated by dis- crete cosine transform (DCT) in order to remove redundant
information. Figure 9 shows the process of MFCC feature extraction.
4
Figure 9. MFCC extraction process
The first step, pre-processing, consists of pre-emphasizing, frame block- ing and
windowing of the signal. The aim of this step is to model small (typically, 20ms)
sections of the signal (frame) that are statistically stationary. The window function,
typically a Hamming window, removes edge effects. The next step takes the Discrete
Fourier transform (DFT) of each frame. We retain only the logarithm of the amplitude
spectrum. We discard phase information because perceptual studies have shown that
the amplitude of the spectrum is much more important than the phase. We take the
logarithm of the amplitude because the perceived loudness of a signal has been found
to be approximately logarithmic. After a discrete Fourier transform, the power
spectrum is transformed to Mel-frequency scale. This step smooths the spectrum and
emphasizes perceptually meaningful frequencies. Mel- frequency scale is based on
mapping between actual frequency and perceived pitch by human auditory system.
The mapping is approximately linear below 1 KHz and logarithmic above. This is
done using a filter bank consisting of triangular filters, spaced uniformly on the Mel-
scale. An approximate conversion between a frequency value in Hertz (f) and in mel
is given by:
𝑚𝑒𝑙(𝑓) = 2595𝑙𝑜𝑔10 (1 +𝑓
700)
Finally, the cepstral coefficients are calculated from the mel-spectrum by taking the
discrete cosine transform (DCT) of the logarithm of the mel-spectrum. This
calculation is given in by:
𝑐𝑖 = ∑(𝑙𝑜𝑔𝑆𝑘). 𝑐𝑜𝑠(𝑖𝜋
𝐾(𝑘 −
1
2))
𝐾−1
𝑘=0
Where ci is the ith
MFCC, Sk is the output of the kth
filter bank channel (i.e. the
weighted sum of the power spectrum bins on that channel) and K is the number of
coefficients (number of the Mel-filter banks). The used K value is usually between 20
and 40, mostly 23.
The components of MFCCs are the first few DCT coefficients that describe the coarse
spectral shape. The first DCT coefficient represents the average power (energy) in the
spectrum. The second coefficient approximates the broad shape of the spectrum and is
5
related to the spectral centroid. The higher order coefficients represent finer spectral
details (e.g., pitch). In practice, the first 8-13 MFCC coefficients are used to represent
the shape of the spectrum. The higher order coefficients are ignored since they
provide more redundant information. However, some applications require more
higher-order coefficients to capture pitch and tone information. The Mel spectrum is
particularly useful in Machine Learning tasks, because it is stable to deformation
using a Euclidean norm, unlike the spectrogram. However, the averaging used to
create the Mel-spectrum causes significant loss of high-frequency information unless
the window size is kept small.
A.1.7. Bark bands
Although the Mel bands are used for the Mel Frequency Cepstral Coefficients, the
Bark bands are a better approximation of the Human Auditory System. This latter will
be used for the calculation of the Loudness, specific loudness, sharpness and spread.
Conversion from Hz to the Bark scale:
𝐵 = 13. 𝑎 𝑡𝑎𝑛(𝑓
1315,8) + 3,5. 𝑡𝑎𝑛(
𝑓
7518)
Where B is the frequency expressed in Bark, and f in Hertz. The linear frequency axe
is converted into Bark scale. The bark scale axe is then divided into 24 equally spaced
bands. The energy of the bins k of the FFT corresponding to each Bark band z are
then summed up to form the contribution to the band z.
𝑎𝑚𝑝𝑙𝑏𝑎𝑛𝑑_𝑣(𝑧) = ∑ 𝐴𝑘2
𝑒𝑛𝑑(𝑧)
𝑘=𝑏𝑒𝑔𝑖𝑛(𝑧)
Where Ak is the amplitude of the bin k of the FFT.
A.1.8. Zero Crossing Rate (ZCR)
Zero Crossing Rate is the most common type of zero crossing based audio features. It
is defined as the number of time-domain zero crossings within a processing frame. It
indicates the frequency of signal amplitude sign change. ZCR allow for a rough
6
estimation of dominant frequency and spectral centroid. We used the following
equation to compute the average zero-crossing rate.
𝑍𝐶𝑅 = 1
2𝑁(∑ |𝑠𝑔𝑛(𝑥(𝑛)) − 𝑠𝑔𝑛(𝑥(𝑛 − 1))|)
𝑁
𝑛=1
where x is the time-domain signal, sgn is the signum function, and N is the size of
processing frame. The signum function implementation can be defined as
𝑠𝑔𝑛(𝑥) = {1 𝑥 ≥ 0−1 𝑥 < 0
One of the most attractive properties of the ZCR is that it is very fast to calculate. As
being a time-domain feature, there is no need to calculate the spectra. Fur- thermore, a
system which uses only the ZCR-based features would not even need digital-to-
analog conversion, but only the information whenever the sign of the signal changes.
However, ZCR can be sensitive to noise. Though using a threshold value (level) near
to zero can significantly reduce the sensitivity to noise, deter- mining appropriate
threshold level is not easy.
A.1.9. Spectral Flux (SF)
The Spectral Flux is a 2-norm of the frame-to-frame spectral amplitude difference
vector. It defines the amount of frame-to-frame fluctuation in time. i.e., it measures
the change in the shape of the power spectrum. It is computed via the energy
difference between consecutive frames as follows:
𝑆𝐹𝑗 = ∑ ||𝑋𝑓(𝑘)| − |𝑋𝑓−1(𝑘)||
𝐾−1
𝑘=0
Where f is the index of the frame and K is the frame length. Spectral flux is an
efficient feature for speech/music discrimination, since in speech the frame-to-frame
spectra fluctuate more than in music, particularly in unvoiced speech.
7
A.1.10. Short Time Energy (STE)
The short-time energy is one of energy based audio features. Li and Zhang used it to
classify audio signals. It is easy to calculate and provides a convenient representation
of the amplitude variation over time. It indicates the loudness of an audio signal. STE
is a reliable indicator for silence detection. It is defined to be the sum of a squared
time domain sequences of audio data, as shown in equation.
𝑆𝑇𝐸 =1
𝑁∑(𝑥(𝑛))2
𝑁
𝑛=1
Where x(n) is the value of the sample (in time domain) and N is the total number of
samples in the processing window (frame size). The STE of audio signal may be
affected by the gain value of the recording devices. Usually we normalize the value of
STE to reduce the effect.
ZCR and STE are widely used in speech and music recognition applications. Speech,
for example, has a high variance in ZCR and STE values, while in music these values
are normally much more constant. ZCR and STE have been also used in ESR
applications due to their simplicity and low computational complexity.
A.1.11. Temporal Centroid (TC)
Temporal Centroid is the time average over the envelope of a signal in seconds. It is
the point in time where most of the energy of the signal is located in average.
𝑇𝐶 =∑ 𝑛. |𝑥(𝑛)|2𝑁
𝑛=1
∑ |𝑥(𝑛)|2𝑁𝑛=1
Note that the computation of temporal centroid is equivalent to that of spectral
centroid in the frequency domain.
A.1.12. Energy Entropy (EE)
The short-term entropy of energy can be interpreted as a measure of abrupt changes in
the energy level of an audio signal. In order to compute it, we first divide each short-
term frame in K sub-frames of fixed duration. Then for each sub-frame, j, we compute
8
its energy as for STE and divide it by the total energy, Eshortframe, of the short-term
frame.
𝑒𝑗 =𝐸𝑠ℎ𝑜𝑟𝑡𝑓𝑟𝑎𝑚𝑒𝑗
𝐸𝑠ℎ𝑜𝑟𝑡𝑓𝑟𝑎𝑚𝑒𝑖
where
𝐸𝑠ℎ𝑜𝑟𝑡𝑓𝑟𝑎𝑚𝑒𝑖= ∑ 𝐸𝑠ℎ𝑜𝑟𝑡𝑓𝑟𝑎𝑚𝑒𝑘
𝐾
𝑘=1
At a final step, the entropy, H(i), of the sequence ej is computed according to the
equation:
𝐻(𝑖) = − ∑ 𝑒𝑗 . 𝑙𝑜𝑔2(𝑒𝑗)
𝐾
𝑗=1
A.1.13. Autocorrelation (AC)
The autocorrelation domain represents the correlation of a signal with a time-shifted
version of the same signal for different time lags. It reveals repeating patterns and
their periodicities in a signal and can be employed, for example, for the estimation of
the fundamental frequency of a signal. This allows distinguishing between sounds that
have harmonic spectrum and non- harmonic spectrum, e.g., between musical sounds
and noise. Autocorrelation of a signal is calculated as follows:
𝐴𝐶 = 𝑓𝑥𝑥[𝜏] = 𝑥[𝜏] ∗ 𝑥[−𝜏] = ∑ 𝑥(𝑛). 𝑥(𝑛 + 𝜏)
𝑁
𝑛=0
Where is the lag (discrete delay index), 𝑓𝑥𝑥[𝜏] is the corresponding autocorrelation
value, N is the length of the frame, n the sample index and when =0, 𝑓𝑥𝑥[𝜏] becomes
the signal's power. Similar to the way RMS is computed, autocorrelation also steps
through windowed portions of a signal where each window frame's samples are
multiplied with each other and then summed according to the above equation. This is
9
repeated where one frame is kept constant while the other x(n+) is updated by
shifting the input x(n) via .
A.1.14. Root Mean Square (RMS)
As STE, the RMS value is a measurement of energy in a signal. The RMS value is
however defined to be the square root of the average of a squared signal, as seen in
the following equation:
𝑅𝑀𝑆 = √1
𝑁∑(𝑥(𝑛))2
𝑁
𝑛=1
10
A.2. Matlab Code
A.2.1. Workspace_Generator
%%%%%%%%%%%%%%%%%%%%%%%%%%%5%%%%%
% workspace generator node 275 %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear; clc; rand('seed',1)
%startdata 16-10-2014 8uur
year=2014;
month=10;
day=16;
hour=08;
zero=num2str(0);
streepje=('-');
maand=1;
%open alle bestanden van knoop 275 in de cell files
for aantalmaanden=1:13
filesmaand=cell(1,2);
countfile=0;
while maand==aantalmaanden
countfile=countfile+1;
if day<10
day_string=num2str(day);
day_string=strcat(zero, day_string);
else
day_string=num2str(day);
end
if month<10
month_string=num2str(month);
month_string=strcat(zero, month_string);
else
month_string=num2str(month);
end
if hour<10
hour_string=num2str(hour);
hour_string=strcat(zero,hour_string);
else
hour_string=num2str(hour);
end
year_string=num2str(year);
date=strcat(year_string,streepje,month_string,streepje,day_string,str
eepje,hour_string);
init=('node_275_utc_');
format1=('.txt.gz');
format2=('.txt');
filename_zip=strcat(init,date,format1);
filename_unzip=strcat(init,date,format2);
if exist(filename_unzip,'file')
importdata(filename_unzip);
ans2=ans.data;
filesmaand(countfile,1)={ans2};
filesmaand(countfile,2)={[year month day hour]};
elseif exist(filename_zip,'file')
gunzip(filename_zip);
importdata(filename_unzip);
ans2=ans.data;
filesmaand(countfile,1)={ans2};
filesmaand(countfile,2)={[year month day hour]};
11
else
filesmaand(countfile,1)={zeros(28800,31)};
filesmaand(countfile,2)={[year month day hour]};
end
if hour==23
hour=0;
day=day+1;
if and(month==1,day==32)
day=1;
month=month+1;
maand=maand+1;
elseif month==2 && day==29
day=1;
month=month+1;
maand=maand+1;
elseif month==3 && day==32
day=1;
month=month+1;
maand=maand+1;
elseif month==4 && day==31
day=1;
month=month+1;
maand=maand+1;
elseif month==5 && day==32
day=1;
month=month+1;
maand=maand+1;
elseif month==6 && day==31
day=1;
month=month+1;
maand=maand+1;
elseif month==7 && day==32
day=1;
month=month+1;
maand=maand+1;
elseif month==8 && day==32
day=1;
month=month+1;
maand=maand+1;
elseif month==9 && day==31
day=1;
month=month+1;
maand=maand+1;
elseif month==10 && day==32
day=1;
month=month+1;
maand=maand+1;
elseif month==11 && day==31
day=1;
month=month+1;
maand=maand+1;
elseif month==12 && day==32
day=1;
month=1;
maand=maand+1;
year=year+1;
end
else
hour=hour+1;
end
clearvars ans2;
end
puntmat=('.mat');
filenaam=strcat(year_string,streepje,month_string,puntmat);
save(filenaam,'filesmaand','-v7.3');
end
A.2.2. Mid-level_GMM_generator
12
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Mid-level GMM generator: 5 components %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear;
clc;
rand('seed',1)
N_7_k=5;
N_7_maxiter=100;
timespan=480;
N_7_xx=linspace(1,31,31)';
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
%load data per month
for maand=10:22
month=maand;
N_7_featurevectorcell=cell(1,2);
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
filename=strcat(year_string,streepje,month_string);
files=matfile(filename);
filesmaand=files.filesmaand;
[filesmaand_rijen, filesmaand_kolommen]=size(filesmaand);
starttijd=1;
for h=1:filesmaand_rijen
hour_to_open=filesmaand{h,1};
[hour_to_open_rij, hour_to_open_kolom]=size(hour_to_open);
if mod((hour_to_open_rij-starttijd),timespan)==0
binarynumber=0;
elseif h==filesmaand_rijen
binarynumber=-1;
else
binarynumber=1;
end
rowsfeaturevectormatrix=N_7_k*...
(floor((hour_to_open_rij-...
(starttijd-1))/timespan)+binarynumber);
columfeaturevectormatrix=N_7_k;
N_7_featurevectormatrix=zeros(rowsfeaturevectormatrix,...
columfeaturevectormatrix);
N_7_featurevectormatrixcell=cell...
(rowsfeaturevectormatrix/N_7_k,1);
forloopaantal=floor((hour_to_open_rij-...
(starttijd-1))/timespan);
parfor r=1:forloopaantal
N_7_matrix=zeros(timespan*31,2);
for i=1:timespan
spectra=hour_to_open...
(starttijd+((r-1)*timespan)+i-1,:);
additionmatrix=[N_7_xx spectra'];
N_7_matrix((i-1)*31+1:...
((i-1)*31)+31,1:2)=additionmatrix;
end
13
N_7_GMModel=fitgmdist(N_7_matrix,N_7_k,'MaxIter'...
,N_7_maxiter,'RegularizationValue',0.1,'Start',...
'randSample','CovarianceType','full');
N_7_featurevectormatrixje=zeros(N_7_k,5);
for c=1:N_7_k
component_mu=N_7_GMModel.mu(c,:);
component_sigma=N_7_GMModel.Sigma(:,:,c);
N_7_featurevectormatrixje(c,:)=...
[component_mu(1,1) component_mu(1,2)...
component_sigma(1,1) component_sigma(1,2)...
component_sigma(2,2)];
end
N_7_featurevectormatrixcell(r,1)=...
{N_7_featurevectormatrixje};
end
starttijd=starttijd+forloopaantal*timespan;
%vanaf hier de overschot vd matrix nemen met de volgende file
N_7_matrix=zeros(timespan*31,2);
counter2=0;
%aantalvanvorigecel=filetoopenrij-starttijd;
if binarynumber==1
for o=starttijd:hour_to_open_rij
spectra=hour_to_open(o,:);
additionmatrix=[N_7_xx spectra'];
N_7_matrix(counter2*31+1:(counter2*31)+31,1:2)...
=additionmatrix;
counter2=counter2+1;
end
aantalvannieuwecel=timespan-...
(hour_to_open_rij-starttijd+1);
starttijd=1;
hour_to_open=filesmaand{h+1,1};
for u=starttijd:aantalvannieuwecel
spectra=hour_to_open(u,:);
additionmatrix=[N_7_xx spectra'];
N_7_matrix(counter2*31+1:(counter2*31)...
+31,1:2)=additionmatrix;
counter2=counter2+1;
end
N_7_GMModel=fitgmdist(N_7_matrix,N_7_k,...
'MaxIter',N_7_maxiter,'RegularizationValue',0.1,'Start'...
,'randSample','CovarianceType','full');
N_7_featurevectormatrixje=zeros(N_7_k,5);
for c=1:N_7_k
%counter1=counter1+1;
component_mu=N_7_GMModel.mu(c,:);
component_sigma=N_7_GMModel.Sigma(:,:,c);
N_7_featurevectormatrixje(c,:)=...
[component_mu(1,1) component_mu(1,2) ...
component_sigma(1,1) component_sigma(1,2) ...
component_sigma(2,2)];
end
N_7_featurevectormatrixcell((rowsfeaturevectormatrix...
/5),1)={N_7_featurevectormatrixje};
end
for b=0:forloopaantal+binarynumber-1
N_7_featurevectormatrix((b*N_7_k)+1:(b*N_7_k)+5,:)...
=N_7_featurevectormatrixcell{b+1,1};
end
N_7_featurevectorcell(h,1)={N_7_featurevectormatrix};
N_7_featurevectorcell(h,2)={filesmaand(h,2)};
starttijd=aantalvannieuwecel+1;
end
14
features_string=('features');
filenaamfeature=strcat(features_string,streepje,year_string,...
streepje,month_string,puntmat);
save(filenaamfeature,'N_7_featurevectorcell','-v7.3');
end
sprintf('done')
A.2.3. Cluster Gaussian Components 5D
%%%%%%%%%%%%%%%%%%%%%%%%%
% Gaussian cluster 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%
clear
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]...
=size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
cel=N_7_featurevectorcell{k,2};
[rijblabla kolomblabla]=size(N_7_featurevectorcell{k,1});
matrixblabla=[];
for h=1:rijblabla
matrixblabla=[matrixblabla;cel{1,1}];
end
gaussiancel{counter1,2}=matrixblabla;
end
end
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,1:5);
%maak nu een gaussian mixture model van alle gaussians
N_10_kmax=60;
N_10_maxiter=100;
N_10_replicates=10;
15
%Create and select the best GMM with AIC and determine k
AIC=zeros(1,N_10_kmax);
N_10_GMModels=cell(1,N_10_kmax);
N_10_options=statset('MaxIter',N_10_maxiter);
for k=1:60
k
N_10_GMModels{k}=fitgmdist(N_10_matrix,k,...
'MaxIter',N_10_maxiter,'RegularizationValue',0.1,...
'Start','randSample','CovarianceType','full');
N_10_AIC(k)=N_10_GMModels{k}.AIC;
save('N_10_GMModels.mat','N_10_GMModels','-v7.3');
end
[minAIC,numComponents]=min(N_10_AIC);
N_10_bestModel=N_10_GMModels{numComponents};
for c=1:numComponents
component_mu=N_10_bestModel.mu(c,:);
component_sigma=N_10_bestModel.Sigma(:,:,c);
N_10_featurevectormatrix(c,:)=[component_mu(1,1)...
component_mu(1,2) component_sigma(1,1)...
component_sigma(1,2) component_sigma(2,2)];
end
%plot the 3d gaussians to see if it fits
figure;
hold on
for u=1:numComponents
mu=N_10_featurevectormatrix(u,1:2);
SIGMA=[N_10_featurevectormatrix(u,3)...
N_10_featurevectormatrix(u,4);...
N_10_featurevectormatrix(u,4) N_10_featurevectormatrix(u,5)];
X = mvnrnd(mu,SIGMA,100);
gmm=fitgmdist(X,1,'MaxIter',N_10_maxiter,...
'RegularizationValue',0.1,'Start','randSample',...
'CovarianceType','full');
ezsurf(@(x1,x2)pdf(gmm,[x1 x2]),[0 31],[0 80],100);
end
hold off;
%scatter(N_10_matrix(:,1),N_10_matrix(:,2));
%figure;
%ezcontour(@(x1,x2)pdf(N_10_bestModel,[x1 x2]),[0 31],[0 80],100);
%figure;
%ezsurf(@(x1,x2)pdf(N_10_bestModel,[x1 x2]),[0 31],[0 80],100);
sprintf('done')
A.2.4. Define anomalies based on clustering
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Gaussian anomaly 5d based on pdf %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%load N_10_bestModel in gaussian5d
%save gaussiananomaly5d
%save gaussiananomalies5d
%save gaussiananomalydata5d
%save gaussiananomalydatums5d
%clear
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
16
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,streepje,...
month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen,...
featurevectorcell_kolommen]=size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
cel=N_7_featurevectorcell{k,2};
[rijblabla kolomblabla]=size(N_7_featurevectorcell{k,1});
matrixblabla=[];
for h=1:rijblabla
matrixblabla=[matrixblabla;cel{1,1}];
end
gaussiancel{counter1,2}=matrixblabla;
end
end
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,1:5);
y=pdf(N_10_bestModel,N_10_matrix);
percentielthreshold= prctile(y,0.05);
minderdan=sum(y<percentielthreshold);
%give data for each anomaly
[rijen kolommen]=size(gaussianmat);
gaussiananomaly5d=gaussianmat(:,1);
gaussiananomalydata5d=gaussianmat(:,6:9);
gaussiananomalies5d=[];
gaussiananomalydatums5d=[];
pdfen=[];
for u=1:rijen
if y(u)<percentielthreshold
gaussiananomaly5d(u)=1;
gaussiananomalies5d=[gaussiananomalies5d;...
N_10_matrix(u,:) y(u)];
gaussiananomalydatums5d=[gaussiananomalydatums5d;...
gaussiananomalydata5d(u,:) y(u) u];
pdfen=[pdfen; y(u)];
else
gaussiananomaly5d(u)=0;
gaussiananomalydata5d(u,:)=0;
end
end
pdfen=sortrows(pdfen,1);
gaussiananomalies5d=sortrows(gaussiananomalies5d,6);
gaussiananomalies5d=gaussiananomalies5d(:,1:5);
gaussiananomalydatums5d=sortrows(gaussiananomalydatums5d,5);
[pointanomalies kolomblabla]=size(gaussiananomalydatums5d);
indexgaussian5d=cluster(N_10_bestModel,N_10_matrix);
17
featurevectorgaussian5d=zeros(rijen,5);
for h=1:rijen
index=indexgaussian5d(h);
featurevectorgaussian5d(h,1:5)=N_10_bestModel.mu(index,1:5);
end
hold on;
scatter(gaussiananomalies5d(:,1),gaussiananomalies5d(:,2),...
[],'filled');
hold off;
A.2.5. Plot minutes of anomalous Gaussian components
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% spectrogram abnormal gaussian 5D %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% laad gaussiananomaly5d.mat
% laad gaussiananomalydata5d.mat
% laad gaussiananomalydatums5d.mat
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=...
size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
cel=N_7_featurevectorcell{k,2};
[rijblabla kolomblabla]=size(N_7_featurevectorcell{k,1});
matrixblabla=[];
for h=1:rijblabla
matrixblabla=[matrixblabla;cel{1,1}];
end
gaussiancel{counter1,2}=matrixblabla;
end
end
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,6:9);
[rijen, kolommen]=size(gaussiananomaly5d);
%in which minute does the anomaly occur?
18
anomalycounter=0;
[rijendata kolomendata]=size(gaussiananomalydatums5d);
welkeminuut=zeros(rijendata,5);
for n=1:rijendata
anomalycounter=anomalycounter+1;
welkeminuut(anomalycounter,1)=gaussiananomalydatums5d(n,1);
welkeminuut(anomalycounter,2)=gaussiananomalydatums5d(n,2);
welkeminuut(anomalycounter,3)=gaussiananomalydatums5d(n,3);
welkeminuut(anomalycounter,4)=gaussiananomalydatums5d(n,4);
uur=gaussiananomalydatums5d(n,4);
hour=uur;
nogeencounter=0;
idx2=gaussiananomalydatums5d(n,6);
while hour==uur
nogeencounter=nogeencounter+1;
blabla=idx2-nogeencounter;
if blabla==0
hour=38;
else
hour=N_10_matrix(blabla,4);
end
end
welkeminuut(anomalycounter,5)=ceil(nogeencounter/5);
end
[rijenanomalies,kolommenanomalies]=size(welkeminuut);
for z=1:rijenanomalies
figure;
jaar=welkeminuut(z,1);
maand=welkeminuut(z,2);
dag=welkeminuut(z,3);
uur=welkeminuut(z,4);
minuut=welkeminuut(z,5);
jaar_string=num2str(jaar);
maand_string=num2str(maand);
if maand<10
maand_string=strcat(zero,maand_string);
end
file=strcat(jaar_string,streepje,maand_string,formatmat);
file=matfile(file);
filesmaand=file.filesmaand;
[rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
vector=[jaar maand dag uur];
countertje=1;
h=1;
blabla1=filesmaand{h,2};
while isequal(blabla1,vector)==0
h=h+1;
blabla1=filesmaand{h,2};
countertje=countertje+1;
end
uurmatrix=filesmaand{countertje,1};
xx=linspace(1,31,31);
h=figure;
[uurrij, uurkolom]=size(uurmatrix);
c = linspace(0,1,480);
if minuut==60
if uurrij<28800
counter8=0;
for g=(minuut*480)-479:uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
uurmatrix2=filesmaand{countertje+1,1};
for b=28800-uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
19
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
end
else
if uurrij<28799
counter8=0;
for g=(minuut*480)-479:min(uurrij,(minuut*480))
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
end
end
rijtezoeken=[jaar maand dag uur];
[ja, positieanomaly]=ismember(rijtezoeken,N_10_matrix,'rows');
positieanomaly=positieanomaly+(minuut-1)*5;
for s=1:5
mu = [gaussianmat(positieanomaly+s-1,1)...
gaussianmat(positieanomaly+s-1,2)];
Sigma = [gaussianmat(positieanomaly+s-1,3)...
gaussianmat(positieanomaly+s-1,4); gaussianmat...
(positieanomaly+s-1,4) gaussianmat(positieanomaly+s-1,5)];
x1 = 0:1:31; x2 = 0:1:80;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
mvncdf([0 0],[1 1],mu,Sigma);
contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
xlabel('frequency'); ylabel('amplitude');
line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
hold on;
end
positieanomaly=gaussiananomalydatums5d(z,6);
scatter(gaussianmat(positieanomaly,1),...
gaussianmat(positieanomaly,2),200,[0 .6 .2],'d','LineWidth',3);
caption=sprintf('anomalous gaussian, datum: %d-%d-%d %dh,...
minuut:%d',jaar,maand,dag,uur,minuut);
title(caption, 'FontSize', 15);
hold off
saveas(h,sprintf('gaussian5d_anomalies_FIG_%d.fig',z));
close all;
%if you prefer to print only the anomalous gaussian contour
% figure;
%
% mu = [gaussianmat(positieanomaly,1)...
gaussianmat(positieanomaly,2)];
% Sigma = [gaussianmat(positieanomaly,3)...
gaussianmat(positieanomaly,4); gaussianmat(positieanomaly,4)
gaussianmat(positieanomaly,5)];
20
% x1 = 0:1:31; x2 = 0:1:80;
% [X1,X2] = meshgrid(x1,x2);
% F = mvnpdf([X1(:) X2(:)],mu,Sigma);
% F = reshape(F,length(x2),length(x1));
% mvncdf([0 0],[1 1],mu,Sigma);
% contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
% xlabel('frequency'); ylabel('amplitude');
% line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
% hold on;
% scatter(gaussianmat(positieanomaly,1),...
gaussianmat(positieanomaly,2),200,[0 .6 .2],'d','LineWidth',3);
%
% caption=sprintf('anomalous gaussian, datum:...
%d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
% title(caption, 'FontSize', 15);
%
% hold off
% saveas(h,sprintf('onegaussian5d_anomalies_FIG_%d.fig',z));
% close all;
end
A.2.6. Cluster based on 5D iKmeans
%%%%%%%%%%%%%%%%%%%%%%%
% cluster k-means 5d %
%%%%%%%%%%%%%%%%%%%%%%%
%steek eerst alles in 1 matrix.
clear
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=size...
(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
21
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,1:5);
%kmeans werkt niet want veel te grote matrixen nodig
%gebruik iK-means
[Centroids QtdEntitiesInCluster] = i_Kmeans...
(N_10_matrix, 1, false, 60);
%enkel verticaal clusteren
%[indexnumber Centrevector]=kmeans(N_7_featurevectormatrix(:,1),5);
% x1 = min(N_10_matrix(:,1)):0.01:max(N_10_matrix(:,1));
% x2 = min(N_10_matrix(:,2)):0.01:max(N_10_matrix(:,2));
% [x1G,x2G] = meshgrid(x1,x2);
% XGrid = [x1G(:),x2G(:)]; % Defines a fine grid on the plot
%
% [clusternumber Centrevector] = kmeans(N_10_matrix,58);
%
%
% idx2Region =
kmeans(XGrid,58,'MaxIter',1,'Start',Centrevector(:,1:2));
%
% figure;
% gscatter(XGrid(:,1),XGrid(:,2),idx2Region,...
% [0,0.75,0.75;0.75,0,0.75;0.75,0.75,0;...
% 0.5,0,0;0,0.5,0;0,0,0.5;...
% 0.2,0,0;0,0.2,0;0,0,0.2;...
% 0,0.8,0.8;0.8,0,0.8;0,0,0.8;...
% 0,0.35,0.35;0.35,0,0.35;0.35,0.35,0;...
% 0,0.75,0.75;0.75,0,0.75;0.75,0.75,0;...
% 0.5,0,0;0,0.5,0;0,0,0.5;...
% 0.2,0,0;0,0.2,0;0,0,0.2;...
% 0,0.8,0.8;0.8,0,0.8;0,0,0.8;...
%
0,0.35,0.35;0.35,0,0.35;0.35,0.35,0;...0,0.75,0.75;0.75,0,0.75;0.75,0
.75,0;...
% 0.5,0,0;0,0.5,0;0,0,0.5;...
% 0.2,0,0;0,0.2,0;0,0,0.2;...
% 0,0.8,0.8;0.8,0,0.8;0,0,0.8;...
%
0,0.35,0.35;0.35,0,0.35;0.35,0.35,0;...0,0.75,0.75;0.75,0,0.75;0.75,0
.75,0;...
% 0.5,0,0;0,0.5,0;0,0,0.5;...
% 0.2,0,0;0,0.2,0;0,0,0.2;...
% ],'..');
% hold on;
% plot(N_10_matrix(:,1),N_10_matrix(:,2));
% title 'Mean values of gaussian components';
% xlabel 'Frequency band mean';
% ylabel 'Amplitude mean';
% %legend('Region 1','Region 2','Region 3','Region 4','Region
5','Data','Location','Best');
% hold off;
A.2.7. Anomalies based on 5d iKmeans
%%%%%%%%%%%%%%%%%%%%%%%%%
% posterior kmeans 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%
%load Centroids5d in kmeans5d
clc
year=2014;
22
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,...
year_string,streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=size...
(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,1:5);
[rijen kolommen]=size(gaussianmat);
distancematrix=zeros(rijen,19);
for y=1:rijen
for o=1:19
D = pdist2(Centroids(o,:),N_10_matrix(y,1:5));
distancematrix(y,o)=D;
end
end
%mindistance=zeros(rijen,1);
distancetranspose=transpose(distancematrix);
[mindistance, indexkmeans5d]=min(distancetranspose);
mindistance=transpose(mindistance);
indexkmeans5d=transpose(indexkmeans5d);
% for t=1:rijen
% mindistance(t)=min(distancematrix(t,:));
% end
percentielthreshold= prctile(mindistance,99.99);
meerdan=sum(mindistance>percentielthreshold);
kmeansanomaly5d=gaussianmat(:,1);
kmeansanomalydata5d=gaussianmat(:,6:9);
kmeansanomalies5d=[];
kmeansanomalydatums5d=[];
23
for u=1:rijen
if mindistance(u)>percentielthreshold
kmeansanomaly5d(u)=1;
kmeansanomalies5d=[kmeansanomalies5d;N_10_matrix(u,:)];
kmeansanomalydatums5d=[kmeansanomalydatums5d;...
kmeansanomalydata5d(u,:)];
else
kmeansanomaly5d(u)=0;
kmeansanomalydata5d(u,:)=0;
end
end
[pointanomalies kolomblabla]=size(kmeansanomalydatums5d);
featurevectorkmeans5d=zeros(rijen,5);
for h=1:rijen
index=indexkmeans5d(h);
featurevectorkmeans5d(h,:)=Centroids(index,:);
end
%plot the gmm and the anomaly points
%een kleur per event (event als ze opvolgend zijn)
kleurlijn=[1];
kleurcounter=1;
for k=2:pointanomalies
if isequal(kmeansanomalydatums5d(k,:),...
kmeansanomalydatums5d(k-1,:));
else
kleurcounter=kleurcounter+1;
end
kleurlijn=[kleurlijn kleurcounter];
end
figure;
% for t=1:19
% mu = N_10_bestModel.mu(t);
% Sigma = N_10_bestModel.Sigma(:,:,t);
% x1 = 0:1:31; x2 = 0:1:80;
% [X1,X2] = meshgrid(x1,x2);
% F = mvnpdf([X1(:) X2(:)],mu,Sigma);
% F = reshape(F,length(x2),length(x1));
%
% mvncdf([0 0],[1 1],mu,Sigma);
% contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
% xlabel('x'); ylabel('y');
% line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
% hold on;
% end
% hold off;
%ezcontour(@(x1,x2)pdf(N_10_bestModel,[x1 x2]),[0 31],[0 100],100);
%hold on;
scatter(kmeansanomalies5d(:,1),kmeansanomalies5d(:,2)...
,[],kleurlijn,'filled');
%hold off;
A.2.8. Clustering defined by spectral features
%%%%%%%%%%%%%%%%%%%%%%%%%
% posterior kmeans 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%
%load Centroids5d in kmeans5d
24
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string...
,streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=size...
(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
N_10_matrix=gaussianmat(:,1:5);
[rijen kolommen]=size(gaussianmat);
distancematrix=zeros(rijen,19);
for y=1:rijen
for o=1:19
D = pdist2(Centroids(o,:),N_10_matrix(y,1:5));
distancematrix(y,o)=D;
end
end
%mindistance=zeros(rijen,1);
distancetranspose=transpose(distancematrix);
[mindistance, indexkmeans5d]=min(distancetranspose);
mindistance=transpose(mindistance);
indexkmeans5d=transpose(indexkmeans5d);
% for t=1:rijen
% mindistance(t)=min(distancematrix(t,:));
% end
percentielthreshold= prctile(mindistance,99.99);
meerdan=sum(mindistance>percentielthreshold);
kmeansanomaly5d=gaussianmat(:,1);
kmeansanomalydata5d=gaussianmat(:,6:9);
kmeansanomalies5d=[];
25
kmeansanomalydatums5d=[];
for u=1:rijen
if mindistance(u)>percentielthreshold
kmeansanomaly5d(u)=1;
kmeansanomalies5d=[kmeansanomalies5d;N_10_matrix(u,:)];
kmeansanomalydatums5d=[kmeansanomalydatums5d;kmeansanomalydata5d(u,:)
];
else
kmeansanomaly5d(u)=0;
kmeansanomalydata5d(u,:)=0;
end
end
[pointanomalies kolomblabla]=size(kmeansanomalydatums5d);
featurevectorkmeans5d=zeros(rijen,5);
for h=1:rijen
index=indexkmeans5d(h);
featurevectorkmeans5d(h,:)=Centroids(index,:);
end
%plot the gmm and the anomaly points
%een kleur per event (event als ze opvolgend zijn)
kleurlijn=[1];
kleurcounter=1;
for k=2:pointanomalies
if isequal(kmeansanomalydatums5d(k,:),...
kmeansanomalydatums5d(k-1,:));
else
kleurcounter=kleurcounter+1;
end
kleurlijn=[kleurlijn kleurcounter];
end
figure;
% for t=1:19
% mu = N_10_bestModel.mu(t);
% Sigma = N_10_bestModel.Sigma(:,:,t);
% x1 = 0:1:31; x2 = 0:1:80;
% [X1,X2] = meshgrid(x1,x2);
% F = mvnpdf([X1(:) X2(:)],mu,Sigma);
% F = reshape(F,length(x2),length(x1));
%
% mvncdf([0 0],[1 1],mu,Sigma);
% contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
% xlabel('x'); ylabel('y');
% line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
% hold on;
% end
% hold off;
%ezcontour(@(x1,x2)pdf(N_10_bestModel,[x1 x2]),[0 31],[0 100],100);
%hold on;
scatter(kmeansanomalies5d(:,1),kmeansanomalies5d(:,2)...
,[],kleurlijn,'filled');
%hold off;
A.2.9. Anomalies based on spectral feature clustering
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% posterior GMM spectral features %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
26
PCAspectralfeatures=('PCAspectralfeatures');
streepje=('-');
formatmat=('.mat');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
year=2014;
zero=('0');
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(PCAspectralfeatures,streepje,...
year_string,streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_6_featurevectorcellpca=feature_file.N_6_featurevectorcellpca;
[featurevectorcell_rijen, featurevectorcell_kolommen]=size...
(N_6_featurevectorcellpca);
for k=1:featurevectorcell_rijen
hour_to_open=N_6_featurevectorcellpca{k,1};
data_to_open=N_6_featurevectorcellpca{k,2};
[rij kolom]=size(hour_to_open);
nieuwekolom=zeros(rij,2);
data_to_open=[data_to_open nieuwekolom];
anomalies=0;
for h=1:rij
counter1=counter1+1;
minuut=floor(h/480)+1;
data_to_open(h,5)=minuut;
anomalies=anomalies+gaussiananomaly(counter1,1);
if mod((h),480)==0
data_to_open(h,6)=anomalies;
for c=1:479
data_to_open((h-c),6)=anomalies;
end
anomalies=0;
elseif h==rij
for w=0:mod(rij,480)-1
data_to_open(h-w,6)=anomalies;
end
anomalies=0;
end
end
N_6_featurevectorcellpca{k,2}=data_to_open;
% gaussiancel{counter1,1}=N_6_featurevectorcell{k,1};
% gaussiancel{counter1,2}=N_6_featurevectorcell{k,2};
end
PCAspectralfeatures_anomalies=('PCAspectralfeatures_anomalies');
feature_file=strcat(PCAspectralfeatures_anomalies,streepje...
,year_string,streepje,month_string,formatmat);
save(feature_file,'N_6_featurevectorcellpca','-v7.3');
27
end
A.2.10. Unusual minutes Joint Probability
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% unusual minute joint probability 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%load indexgaussian5d.mat uit gaussian 5d
%save unusualminutesjointprob
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string...
,streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=...
size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
%maak nieuwe matrix per minuut
[aantalrijenx5 aantalkolommen]=size(indexgaussian5d);
aantalrijen=aantalrijenx5/5;
minuutmatrix=zeros(aantalrijen,10);
counterminuut=0;
for b=1:aantalrijenx5
rij=ceil(b/5);
if mod(b,5)==0
kolom=5;
else
kolom=mod(b,5);
end
minuutmatrix(rij,kolom)=indexgaussian5d(b);
if mod(b,5)==1
minuutmatrix(rij,6:9)=gaussianmat(b,6:9);
end
if b>1
if gaussianmat(b,9)==gaussianmat(b-1,9);
28
counterminuut=counterminuut+1;
minuutmatrix(rij,10)=ceil(counterminuut/5);
else
counterminuut=1;
minuutmatrix(rij,10)=counterminuut;
end
else
counterminuut=1;
minuutmatrix(rij,10)=counterminuut;
end
end
string=('string');
A=zeros(max(indexgaussian5d),1);
for s=1:max(indexgaussian5d)
A(s)=(sum(indexgaussian5d(:)==s)/aantalrijenx5);
end
jointprobmatrix=zeros(aantalrijen,1);
for c=1:aantalrijen
jointprob=1;
for w=1:max(indexgaussian5d)
if sum(minuutmatrix(c,:)==w)*A(w)>0
jointprob=jointprob*sum(minuutmatrix(c,:)==w)*A(w);
end
end
jointprobmatrix(c,1)=jointprob;
end
percentielthreshold= prctile(jointprobmatrix,0.1);
minderdan=sum(jointprobmatrix<percentielthreshold);
unusualminutesjointprob=zeros(minderdan,11);
anomalycountertje=0;
gaussiananomaly5d=zeros(aantalrijen,1);
for g=1:aantalrijen
if jointprobmatrix(g,:)<percentielthreshold
anomalycountertje=anomalycountertje+1;...
unusualminutesjointprob(anomalycountertje,1:10)=...
minuutmatrix(g,:);
unusualminutesjointprob(anomalycountertje,11)...
=jointprobmatrix(g,:);
gaussiananomaly5d(g,1)=0;
else
gaussiananomaly5d(g,1)=0;
end
end
unusualminutesjointprob=sortrows(unusualminutesjointprob,11);
unusualminutesjointprob=unusualminutesjointprob(:,1:10);
A.2.1. Unusual minutes Joint Correlation
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% unusual minute correlation 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%load indexgaussian5d.mat uit gaussian 5d
%save unusualminutesjointcorr
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
29
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=...
size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
%maak nieuwe matrix per minuut
[aantalrijenx5 aantalkolommen]=size(indexgaussian5d);
aantalrijen=aantalrijenx5/5;
minuutmatrix=zeros(aantalrijen,10);
counterminuut=0;
for b=1:aantalrijenx5
rij=ceil(b/5);
if mod(b,5)==0
kolom=5;
else
kolom=mod(b,5);
end
minuutmatrix(rij,kolom)=indexgaussian5d(b);
if mod(b,5)==1
minuutmatrix(rij,6:9)=gaussianmat(b,6:9);
end
if b>1
if gaussianmat(b,9)==gaussianmat(b-1,9);
counterminuut=counterminuut+1;
minuutmatrix(rij,10)=ceil(counterminuut/5);
else
counterminuut=1;
minuutmatrix(rij,10)=counterminuut;
end
else
counterminuut=1;
minuutmatrix(rij,10)=counterminuut;
end
end
%maak die matrix binary
binarymatrix=zeros(aantalrijen,max(indexgaussian5d));
for a=1:aantalrijen
for d=1:max(indexgaussian5d)
som=(sum(minuutmatrix(a,:)==d));
if som>0
binarymatrix(a,d)=1;
else
binarymatrix(a,d)=0;
end
end
end
30
correlations=corrcoef(binarymatrix);
jointcorrelationmatrix=zeros(aantalrijen,1);
for q=1:aantalrijen
optetellen=0;
for r=1:4
for z=r+1:5
if correlations(minuutmatrix(q,z),minuutmatrix(q,r))==1
else
optetellen=optetellen+correlations(minuutmatrix(q,z)...
,minuutmatrix(q,r));
end
end
end
jointcorrelationmatrix(q,1)=optetellen;
end
%very uncommon minutes
percentielthreshold= prctile(jointcorrelationmatrix,0.1);
minderdan=sum(jointcorrelationmatrix<percentielthreshold);
unusualminutesjointcorr=zeros(minderdan,11);
anomalycountertje=0;
gaussiananomaly5d=zeros(aantalrijen,1);
for g=1:aantalrijen
if jointcorrelationmatrix(g,:)<percentielthreshold
anomalycountertje=anomalycountertje+1;
unusualminutesjointcorr(anomalycountertje,1:10)...
=minuutmatrix(g,:);
unusualminutesjointcorr(anomalycountertje,11)...
=jointcorrelationmatrix(g,:);
gaussiananomaly5d(g,1)=1;
else
gaussiananomaly5d(g,1)=0;
end
end
unusualminutesjointcorr=sortrows(unusualminutesjointcorr,11);
unusualminutesjointcorr=unusualminutesjointcorr(:,1:10);
A.2.2. Plot Unusual minutes based on Joint Probability
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% spectrograms low joint probabilities 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% load unusualminutesjointprob uit unusual minutes 5d normal
% load N_10_featurevectormatrix uit gaussian5d normal
% load gaussiananomalydata5d uit gaussian5d normal
% print the unusualminutes
[rij kolom]=size(unusualminutesjointprob);
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
if month>=13
year=2015;
month=month-12;
end
31
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=...
size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
N_10_matrix=gaussianmat(:,6:9);
for z=1:rij
jaar=unusualminutesjointprob(z,6);
maand=unusualminutesjointprob(z,7);
dag=unusualminutesjointprob(z,8);
uur=unusualminutesjointprob(z,9);
minuut=unusualminutesjointprob(z,10);
% second option to define minuut, same result
% hour=uur;
% nogeencounter=0;
% [~, idx2]=ismember(unusualminutesjointprob...
(z,6:9),N_10_matrix,'rows');
% while hour==uur
% nogeencounter=nogeencounter+1;
% blabla=idx2-nogeencounter;
% hour=N_10_matrix(blabla,4);
% end
% minuut=ceil(nogeencounter/5);
jaar_string=num2str(jaar);
maand_string=num2str(maand);
if maand<10
maand_string=strcat(zero,maand_string);
end
file=strcat(jaar_string,streepje,maand_string,formatmat);
file=matfile(file);
filesmaand=file.filesmaand;
[rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
vector=[jaar maand dag uur];
countertje=1;
h=1;
blabla1=filesmaand{h,2};
while isequal(blabla1,vector)==0
h=h+1;
blabla1=filesmaand{h,2};
countertje=countertje+1;
end
uurmatrix=filesmaand{countertje,1};
xx=linspace(1,31,31);
h=figure;
[uurrij, uurkolom]=size(uurmatrix);
c = linspace(0,1,480);
if minuut==60
if uurrij<28800
32
counter8=0;
for g=(minuut*480)-479:uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
uurmatrix2=filesmaand{countertje+1,1};
for b=28800-uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
end
else
if uurrij<28799
counter8=0;
for g=(minuut*480)-479:min(uurrij,(minuut*480))
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
end
%[~,indx]=ismember(unusualminutesjointprob(z,6:9),...
gaussiananomalydata5d,'rows');
%positieanomaly=indx+minuut-1;
%positieanomaly=positiematrix(z);
hold on;
for w=1:5
rijtezoeken=[jaar maand dag uur];
[ja, positieanomaly]=ismember(rijtezoeken,N_10_matrix,...
'rows');
positieanomaly=positieanomaly+(minuut-1)*5;
mu=[N_10_featurevectormatrix(indexgaussian5d...
(positieanomaly+w-1),1)
N_10_featurevectormatrix(indexgaussian5d...
(positieanomaly+w-1),2)];
Sigma=[N_10_featurevectormatrix(indexgaussian5d...
(positieanomaly+w-1),3)
N_10_featurevectormatrix(indexgaussian5d...
(positieanomaly+w-...
1),4);N_10_featurevectormatrix(indexgaussian5d...
(positieanomaly+w-1),4) N_10_featurevectormatrix...
(indexgaussian5d(positieanomaly+w-1),5)];
%mu = [N_10_featurevectormatrix(unusualminutesjointprob(z,w),1)...
N_10_featurevectormatrix(unusualminutesjointprob(z,w),2)];
%Sigma = [N_10_featurevectormatrix(unusualminutesjointprob(z,w),3)
N_10_featurevectormatrix(unusualminutesjointprob(z,w),4);
33
N_10_featurevectormatrix(unusualminutesjointprob(z,w),4)
N_10_featurevectormatrix(unusualminutesjointprob(z,w),5)];
x1 = 0:1:31; x2 = 0:1:80;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
mvncdf([0 0],[1 1],mu,Sigma);
contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
xlabel('frequency'); ylabel('amplitude');
line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
hold on;
scatter(mu(1), mu(2),200,[0 .6 .2],'d','LineWidth',3);
end
% alternative method to find the minute
% for h=0:4
%
% rijtezoeken=[jaar maand dag uur];
% [ja, positieanomaly]...
=ismember(rijtezoeken,N_10_matrix,'rows');
% positieanomaly=positieanomaly+(minuut-1)*5;
% mu = [gaussianmat(positieanomaly+h,1)...
gaussianmat(positieanomaly+h,2)];
% Sigma = [gaussianmat(positieanomaly+h,3)...
gaussianmat(positieanomaly+h,4);...
gaussianmat(positieanomaly+h,4)...
gaussianmat(positieanomaly+h,5)];
% x1 = 0:1:31; x2 = 0:1:80;
% [X1,X2] = meshgrid(x1,x2);
% F = mvnpdf([X1(:) X2(:)],mu,Sigma);
% F = reshape(F,length(x2),length(x1));
%
% mvncdf([0 0],[1 1],mu,Sigma);
% contour(x1,x2,F,[.0001 .001 .01 .05:.1:...
.95 .99 .999 .9999]);
% xlabel('frequency'); ylabel('amplitude');
% line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
% hold on;
% scatter(mu(1), mu(2),200,[0 .6 .2],'d','LineWidth',3);
%
% end
hold off
caption=sprintf('low joint-probability minute,...
datum: %d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
title(caption, 'FontSize', 15);
saveas(h,sprintf('unusualminute_jointprob_5d_FIG_%d.fig',z));
close all;
end
% minute plotted with its proper gaussian components instead of
cluster centroids
% %for z=1:rij
% for z=1:1
% jaar=unusualminutesjointprob(z,6);
% maand=unusualminutesjointprob(z,7);
% dag=unusualminutesjointprob(z,8);
% uur=unusualminutesjointprob(z,9);
% minuut=unusualminutesjointprob(z,10);
% % hour=uur;
% % nogeencounter=0;
% % [~, idx2]=ismember...
(unusualminutesjointprob(z,6:9),N_10_matrix,'rows');
% % while hour==uur
% % nogeencounter=nogeencounter+1;
% % blabla=idx2-nogeencounter;
34
% % hour=N_10_matrix(blabla,4);
% % end
% % minuut=ceil(nogeencounter/5);
% jaar_string=num2str(jaar);
% maand_string=num2str(maand);
% if maand<10
% maand_string=strcat(zero,maand_string);
% end
%
%
%
% file=strcat(jaar_string,streepje,maand_string,formatmat);
% file=matfile(file);
% filesmaand=file.filesmaand;
%
% [rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
% vector=[jaar maand dag uur];
% countertje=1;
% h=1;
% blabla1=filesmaand{h,2};
% while isequal(blabla1,vector)==0
% h=h+1;
% blabla1=filesmaand{h,2};
% countertje=countertje+1;
% end
% uurmatrix=filesmaand{countertje,1};
% xx=linspace(1,31,31);
% h=figure;
%
% [uurrij, uurkolom]=size(uurmatrix);
% c = linspace(0,1,480);
% if minuut==60
% if uurrij<28800
% counter8=0;
% for g=(minuut*480)-479:uurrij
% counter8=counter8+1;
% scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
% hold on
% end
% uurmatrix2=filesmaand{countertje+1,1};
% for b=28800-uurrij
% counter8=counter8+1;
% scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
% hold on
% end
% else
%
% counter8=0;
% for g=(minuut*480)-479:(minuut*480)
% counter8=counter8+1;
% scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
% hold on
% end
% end
% else
% if uurrij<28799
% counter8=0;
% for g=(minuut*480)-479:min(uurrij,(minuut*480))
% counter8=counter8+1;
% scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
% hold on
% end
% else
% counter8=0;
% for g=(minuut*480)-479:(minuut*480)
35
% counter8=counter8+1;
% scatter(xx,uurmatrix(g,:),[],[(1-c(counter8)) ...
0 c(counter8)]);
% hold on
% end
% end
% end
%
% %[~,indx]=ismember(unusualminutesjointprob(z,6:9),...
gaussiananomalydata5d,'rows');
% %positieanomaly=indx+minuut-1;
% %positieanomaly=positiematrix(z);
% hold on;
%
%
% for s=1:5
%
% rijtezoeken=[jaar maand dag uur];
% [ja, positieanomaly]=ismember(rijtezoeken,N_10_matrix,...
'rows');
% positieanomaly=positieanomaly+(minuut-1)*5;
% mu = [gaussianmat(positieanomaly+s-1,1)...
gaussianmat(positieanomaly+s-1,2)];
% Sigma = [gaussianmat(positieanomaly+s-1,3)...
gaussianmat(positieanomaly+s-1,4);...
gaussianmat(positieanomaly+s-1,4) ...
gaussianmat(positieanomaly+s-1,5)];
% x1 = 0:1:31; x2 = 0:1:80;
% [X1,X2] = meshgrid(x1,x2);
% F = mvnpdf([X1(:) X2(:)],mu,Sigma);
% F = reshape(F,length(x2),length(x1));
%
% mvncdf([0 0],[1 1],mu,Sigma);
% contour...
(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
% xlabel('frequency'); ylabel('amplitude');
% line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
% hold on;
% scatter(mu(1), mu(2),200,[0 .6 .2],'d','LineWidth',3);
%
% end
%
% hold off
% caption=sprintf('low joint-probability minute, ...
datum: %d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
% title(caption, 'FontSize', 15);
%
%
% saveas(h,sprintf...
('unusualminuteowngaussian_jointprob_5d_FIG_%d.fig',z));
% close all;
% end
A.2.1. Plot Unusual minutes based on Joint Correlation
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% spectrograms low joint correlations 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% load unusualminutesjointcorr uit unusual minutes 5d normal
% load N_10_featurevectormatrix uit gaussian5d normal
% load gaussiananomalydata5d uit gaussian5d normal
[rij kolom]=size(unusualminutesjointcorr);
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
36
features=('features');
%print the unusualminutes
year=2014;
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=10:22
month=maand;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]...
=size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
N_10_matrix=gaussianmat(:,6:9);
for z=1:rij
jaar=unusualminutesjointcorr(z,6);
maand=unusualminutesjointcorr(z,7);
dag=unusualminutesjointcorr(z,8);
uur=unusualminutesjointcorr(z,9);
%minuut=unusualminutesjointcorr(z,10);
hour=uur;
nogeencounter=0;
[~, idx2]=...
ismember(unusualminutesjointcorr(z,6:9),N_10_matrix,'rows');
while hour==uur
nogeencounter=nogeencounter+1;
blabla=idx2-nogeencounter;
hour=N_10_matrix(blabla,4);
end
minuut=ceil(nogeencounter/5);
jaar_string=num2str(jaar);
maand_string=num2str(maand);
if maand<10
maand_string=strcat(zero,maand_string);
end
file=strcat(jaar_string,streepje,maand_string,formatmat);
file=matfile(file);
filesmaand=file.filesmaand;
37
[rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
vector=[jaar maand dag uur];
countertje=1;
h=1;
blabla1=filesmaand{h,2};
while isequal(blabla1,vector)==0
h=h+1;
blabla1=filesmaand{h,2};
countertje=countertje+1;
end
uurmatrix=filesmaand{countertje,1};
xx=linspace(1,31,31);
h=figure;
[uurrij, uurkolom]=size(uurmatrix);
c = linspace(0,1,480);
if minuut==60
if uurrij<28800
counter8=0;
for g=(minuut*480)-479:uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
uurmatrix2=filesmaand{countertje+1,1};
for b=28800-uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
else
if uurrij<28799
counter8=0;
for g=(minuut*480)-479:min(uurrij,(minuut*480))
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
end
%[~,indx]=ismember(unusualminutesjointprob(z,6:9)...
,gaussiananomalydata5d,'rows');
%positieanomaly=indx+minuut-1;
%positieanomaly=positiematrix(z);
hold on;
for w=1:5
38
mu = [N_10_featurevectormatrix...
(unusualminutesjointcorr(z,w),1) ...
N_10_featurevectormatrix(unusualminutesjointcorr(z,w),2)];
Sigma = [N_10_featurevectormatrix...
(unusualminutesjointcorr(z,w),3) N_10_featurevectormatrix...
(unusualminutesjointcorr(z,w),4); N_10_featurevectormatrix...
(unusualminutesjointcorr(z,w),4) N_10_featurevectormatrix...
(unusualminutesjointcorr(z,w),5)];
x1 = 0:1:31; x2 = 0:1:80;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
mvncdf([0 0],[1 1],mu,Sigma);
contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
xlabel('frequency'); ylabel('amplitude');
line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
hold on;
scatter(mu(1), mu(2),200,[0 .6 .2],'d','LineWidth',3);
end
caption=sprintf('low joint-correlation minute, ...
datum: %d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
title(caption, 'FontSize', 15);
hold off
saveas(h,sprintf('unusualminutes_jointcorr_5d_FIG_%d.fig',z));
close all;
end
%most usual minutes
percentielthreshold= prctile(jointcorrelationmatrix,99.99);
meerdan=sum(jointcorrelationmatrix>percentielthreshold);
unusualminutesjointcorr=zeros(meerdan,10);
anomalycountertje=0;
for g=1:aantalrijen
if jointcorrelationmatrix(g,:)>percentielthreshold
anomalycountertje=anomalycountertje+1;
unusualminutesjointcorr(anomalycountertje,:)...
=minuutmatrix(g,:);
end
end
%print the usualminutes
for z=1:meerdan
jaar=unusualminutes(z,6);
maand=unusualminutes(z,7);
dag=unusualminutes(z,8);
uur=unusualminutes(z,9);
minuut=unusualminutes(z,10);
jaar_string=num2str(jaar);
maand_string=num2str(maand);
if maand<10
maand_string=strcat(zero,maand_string);
end
file=strcat(jaar_string,streepje,maand_string,formatmat);
file=matfile(file);
filesmaand=file.filesmaand;
[rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
vector=[jaar maand dag uur];
countertje=1;
h=1;
blabla1=filesmaand{h,2};
while isequal(blabla1,vector)==0
h=h+1;
blabla1=filesmaand{h,2};
countertje=countertje+1;
end
uurmatrix=filesmaand{countertje,1};
xx=linspace(1,31,31);
39
h=figure;
[uurrij, uurkolom]=size(uurmatrix);
c = linspace(0,1,480);
if minuut==60
if uurrij<28800
counter8=0;
for g=(minuut*480)-479:uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],[(1-c(counter8))...
0 c(counter8)]);
hold on
end
uurmatrix2=filesmaand{countertje+1,1};
for b=28800-uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
else
if uurrij<28799
counter8=0;
for g=(minuut*480)-479:min(uurrij,(minuut*480))
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
end
%positieanomaly=positiematrix(z);
%mu = [gaussianmat(positieanomaly,1)...
gaussianmat(positieanomaly,2)];
%Sigma = [gaussianmat(positieanomaly,3)...
gaussianmat(positieanomaly,4); gaussianmat...
(positieanomaly,4) gaussianmat(positieanomaly,5)];
%x1 = 0:1:31; x2 = 0:1:80;
%[X1,X2] = meshgrid(x1,x2);
%F = mvnpdf([X1(:) X2(:)],mu,Sigma);
%F = reshape(F,length(x2),length(x1));
%mvncdf([0 0],[1 1],mu,Sigma);
%contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
%xlabel('frequency'); ylabel('amplitude');
%line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
%scatter(gaussianmat(positieanomaly,1),...
gaussianmat(positieanomaly,2),200,[0 .6 .2],'d','LineWidth',3);
40
caption=sprintf('high minute correlation, ...
datum: %d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
title(caption, 'FontSize', 15);
hold off
saveas(h,sprintf('highcorrelation_2d_FIG_%d.fig',z));
close all;
end
A.2.2. Cluster Contextual feature vectors
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% contextual anomalies based on 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%load indexgaussian5d.mat uit gaussian 5d normal
%load N_10_featurevectormatrix
%save N_12_bestModel
%save matrix_to_cluster_standardized
%save matrix_to_cluster
%save minuutmatrix
%save minuutmatrix3
%save N_12_featurevectormatrix
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat...
(features,streepje,year_string,...
streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=...
size(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
41
%maak nieuwe matrix per minuut feature vector (12 digits)
[aantalrijenx5 aantalkolommen]=size(indexgaussian5d);
aantalrijen=aantalrijenx5/5;
minuutmatrix=zeros(aantalrijen,17);
counterminuut=0;
for b=1:aantalrijenx5
rij=ceil(b/5);
if mod(b,5)==0
kolom=5;
else
kolom=mod(b,5);
end
minuutmatrix(rij,kolom)=indexgaussian5d(b);
if mod(b,5)==1
minuutmatrix(rij,13:16)=gaussianmat(b,6:9);
end
if b>1
if gaussianmat(b,9)==gaussianmat(b-1,9);
counterminuut=counterminuut+1;
minuutmatrix(rij,17)=ceil(counterminuut/5);
else
counterminuut=1;
minuutmatrix(rij,17)=counterminuut;
end
else
counterminuut=1;
minuutmatrix(rij,17)=counterminuut;
end
end
minuutmatrix2=minuutmatrix;
minuutmatrix3=zeros(aantalrijen,25);
for w=1:aantalrijen
for a=1:5
minuutmatrix2(w,(a*2)-...
1)=N_10_featurevectormatrix(minuutmatrix(w,a),1);
minuutmatrix2(w,(a*2))=N_10_featurevectormatrix...
(minuutmatrix(w,a),2);
minuutmatrix3(w,(a*5)-4)=N_10_featurevectormatrix...
(minuutmatrix(w,a),1);
minuutmatrix3(w,(a*5)-3)=N_10_featurevectormatrix...
(minuutmatrix(w,a),2);
minuutmatrix3(w,(a*5)-2)=N_10_featurevectormatrix...
(minuutmatrix(w,a),3);
minuutmatrix3(w,(a*5)-1)=N_10_featurevectormatrix...
(minuutmatrix(w,a),4);
minuutmatrix3(w,(a*5))=N_10_featurevectormatrix...
(minuutmatrix(w,a),5);
end
hour=minuutmatrix(w,16);
minuut=minuutmatrix(w,17);
hour_float=hour+(minuut/6)/10;
timeangle=(360/24)*hour_float;
minuutmatrix2(w,11)=cosd(timeangle);
minuutmatrix2(w,12)=sind(timeangle);
end
minuutmatrix=minuutmatrix2;
matrix_to_cluster=minuutmatrix2(:,1:12);
matrix_standardize1=matrix_to_cluster(:,1:10);
matrix_standardize2=matrix_to_cluster(:,11:12);
gem1=mean2(matrix_standardize1);
42
stdev1=std2(matrix_standardize1);
gem2=mean2(matrix_standardize2);
stdev2=std2(matrix_standardize2);
matrix_standardize1=(matrix_standardize1-gem1)/stdev1;
matrix_standardize2=(matrix_standardize2-gem2)/stdev2;
matrix_to_cluster(:,1:10)=matrix_standardize1;
matrix_to_cluster(:,11:12)=matrix_standardize2;
matrix_to_cluster_standardized=matrix_to_cluster;
% norm1 = matrix_normalize1 - min(matrix_normalize1(:));
% norm1 = norm1 ./ max(norm1(:));
% norm2 = matrix_normalize2 - min(matrix_normalize2(:));
% norm2 = norm2 ./ max(norm2(:));
% matrix_to_cluster_normalized=zeros(502672,12);
% matrix_to_cluster_normalized(:,1:10)=norm1;
% matrix_to_cluster_normalized(:,11:12)=norm2;
%cluster die matrix nu met gmm
N_12_kmax=60;
N_12_maxiter=100;
N_12_replicates=10;
%Create and select the best GMM with AIC and determine k
N_12_AIC=zeros(1,N_12_kmax);
N_12_GMModels=cell(1,N_12_kmax);
N_12_options=statset('MaxIter',N_12_maxiter);
%'Replicates',N_10_replicates,
for k=1:N_12_kmax
N_12_GMModels{k}=fitgmdist(matrix_to_cluster,k...
,'MaxIter',N_12_maxiter,'RegularizationValue'...
,0.1,'Start','randSample','CovarianceType','full');
N_12_AIC(k)=N_12_GMModels{k}.AIC;
save('N_12_GMModels.mat','N_12_GMModels','-v7.3');
end
[minAIC,numComponents]=min(N_12_AIC);
N_12_bestModel=N_12_GMModels{numComponents};
N_12_featurevectormatrix=zeros(numComponents,12);
for c=1:numComponents
component_mu=N_12_bestModel.mu(c,:);
component_sigma=N_12_bestModel.Sigma(:,:,c);
for t=1:12
N_12_featurevectormatrix(c,t)=component_mu(1,t) ;
end
end
A.2.3. Define contextual anomalies based on clustering
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Posterior contextual anomaly 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%open N_12_bestModel bij contextual 5d
%open matrix_to_cluster_standardized
%load minuutmatrix
%save contextualanomaly5d
%save contextualanomalies5d
%save contextualanomalydata5d
%save contextualanomalydatums5d
%save indexgaussian5d
%save featurevectorgaussian
43
[rijen kolommen]=size(matrix_to_cluster_standardized);
y=pdf(N_12_bestModel,matrix_to_cluster_standardized);
percentielthreshold= prctile(y,0.05);
minderdan=sum(y<percentielthreshold);
contextualanomaly5d=matrix_to_cluster_standardized(:,1);
contextualanomalydata5d=minuutmatrix(:,13:16);
contextualanomalies5d=[];
contextualanomalydatums5d=[];
for u=1:rijen
if y(u)<percentielthreshold
contextualanomaly5d(u)=1;
contextualanomalies5d=[contextualanomalies5d...
;minuutmatrix(u,1:12)];
contextualanomalydatums5d=[contextualanomalydatums5d....
;contextualanomalydata5d(u,:) y(u) u];
else
contextualanomaly5d(u)=0;
contextualanomalydata5d(u,:)=0;
end
end
contextualanomalydatums5d=sortrows(contextualanomalydatums5d,5);
[pointanomalies kolomblabla]=size(contextualanomalydatums5d);
indexgaussian5d=cluster(N_12_bestModel,matrix_to_cluster_standardized
);
featurevectorgaussian5d=zeros(rijen,12);
for h=1:rijen
index=indexgaussian5d(h);
featurevectorgaussian5d(h,:)=N_12_bestModel.mu(index,:);
end
%plot the gmm and the anomaly points
A.2.4. plot spectrograms contextual anomalies
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% spectrograms contextual anomalies 5d %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%load contextualanomaly5d
%load contextualanomalies5d
%load contextualanomalydata5d
%load contextualanomalydatums5d
%load minuutmatrix3
clc
year=2014;
zero=num2str(0);
streepje=('-');
formatmat=('.mat');
puntmat=('.mat');
features=('features');
gaussiancel=cell(1,2);
counter1=0;
counter2=0;
for maand=6:18
month=maand+4;
if month>=13
year=2015;
month=month-12;
44
end
month_string=num2str(month);
if month<10
month_string=strcat(zero,month_string);
end
year_string=num2str(year);
feature_file=strcat(features,streepje,...
year_string,streepje,month_string,formatmat);
feature_file=matfile(feature_file);
N_7_featurevectorcell=feature_file.N_7_featurevectorcell;
[featurevectorcell_rijen, featurevectorcell_kolommen]=size...
(N_7_featurevectorcell);
for k=1:featurevectorcell_rijen
counter1=counter1+1;
gaussiancel{counter1,1}=N_7_featurevectorcell{k,1};
gaussiancel{counter1,2}=N_7_featurevectorcell{k,2};
end
end
gaussianmat=cell2mat(gaussiancel);
gaussianmat=gaussianmat(2401:end,:);
N_10_matrix=gaussianmat(:,6:9);
[rijen, kolommen]=size(contextualanomaly5d);
aantalanomaliesvorig=0;
aantalanomalies=0;
eind=(rijen)-1;
%als er een anomaly is, de hoeveelste minuut in dat uur?
anomalycounter=0;
[rijendata kolomendata]=size(contextualanomalydatums5d);
welkeminuut=zeros(rijendata,5);
for n=1:rijendata
anomalycounter=anomalycounter+1;
welkeminuut(anomalycounter,1)=contextualanomalydatums5d(n,1);
welkeminuut(anomalycounter,2)=contextualanomalydatums5d(n,2);
welkeminuut(anomalycounter,3)=contextualanomalydatums5d(n,3);
welkeminuut(anomalycounter,4)=contextualanomalydatums5d(n,4);
uur=contextualanomalydatums5d(n,4);
hour=uur;
nogeencounter=0;
idx2=contextualanomalydatums5d(n,6);
while hour==uur
nogeencounter=nogeencounter+1;
blabla=idx2-nogeencounter;
hour=N_10_matrix(blabla,4);
end
welkeminuut(anomalycounter,5)=ceil(nogeencounter/5);
end
[rijenanomalies,kolommenanomalies]=size(welkeminuut);
%for z=1:3
for z=1:rijenanomalies
jaar=welkeminuut(z,1);
maand=welkeminuut(z,2);
dag=welkeminuut(z,3);
uur=welkeminuut(z,4);
minuut=welkeminuut(z,5);
jaar_string=num2str(jaar);
maand_string=num2str(maand);
if maand<10
maand_string=strcat(zero,maand_string);
end
45
file=strcat(jaar_string,streepje,maand_string,formatmat);
file=matfile(file);
filesmaand=file.filesmaand;
[rijenfilesmaand, kolommenfilesmaand]=size(filesmaand);
vector=[jaar maand dag uur];
countertje=1;
h=1;
blabla1=filesmaand{h,2};
while isequal(blabla1,vector)==0
h=h+1;
blabla1=filesmaand{h,2};
countertje=countertje+1;
end
uurmatrix=filesmaand{countertje,1};
xx=linspace(1,31,31);
h=figure;
[uurrij, uurkolom]=size(uurmatrix);
c = linspace(0,1,480);
if minuut==60
if uurrij<28800
counter8=0;
for g=(minuut*480)-479:uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
uurmatrix2=filesmaand{countertje+1,1};
for b=28800-uurrij
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
else
if uurrij<28799
counter8=0;
for g=(minuut*480)-479:min(uurrij,(minuut*480))
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
else
counter8=0;
for g=(minuut*480)-479:(minuut*480)
counter8=counter8+1;
scatter(xx,uurmatrix(g,:),[],...
[(1-c(counter8)) 0 c(counter8)]);
hold on
end
end
end
for t=1:5
hold on;
mu = [minuutmatrix3(z,(t*5)-4) minuutmatrix3(z,(t*5)-3)];
46
Sigma = [minuutmatrix3(z,(t*5)-2) ...
minuutmatrix3(z,(t*5)-1); minuutmatrix3(z,(t*5)-1)...
minuutmatrix3(z,(t*5))];
x1 = 0:1:31; x2 = 0:1:80;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
mvncdf([0 0],[1 1],mu,Sigma);
contour(x1,x2,F,[.0001 .001 .01 .05:.1:.95 .99 .999 .9999]);
xlabel('frequency'); ylabel('amplitude');
line([0 0 1 1 0],[1 0 0 1 1],'linestyle','--','color','k');
hold on;
scatter(mu(1), mu(2),200,[0 .6 .2],'d','LineWidth',3);
end
caption=sprintf('contextual anomaly, ...
datum: %d-%d-%d %dh, minuut:%d',jaar,maand,dag,uur,minuut);
title(caption, 'FontSize', 15);
hold off
saveas(h,sprintf('contextual_anomalies_FIG_%d.fig',z));
close all;
end