Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co...
Transcript of Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co...
![Page 1: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/1.jpg)
dathsurepofedaFuOanpeacTheran
su
atsccoCmdeditethabHan
clplHhadiWtwpr
woCoLt
DeCeLo
To
Col.g
Abstract—Thata would makheir atypicalityupport the depresentation ootential uncomatures on a atasets, feedinunction (RBF)ur range of p
nomalous classerforming feactivations, resihis improves rror magnitudnomaly detecti
Index Terms—upport vector m
Anomalous dtypical relativcenario can onsidered inonsider, for e
mass transportetect people wisease controemperature pahreats, or we mbnormality as
Here we are cnomaly detect
Many of tlassification alaced on th
However, theseas normal datairectly learnt
Within the fielwo distinct rrobabilistic
Manuscript receork was supporteouncil (EPSRC) utd.
J. T. A. Andrepartment of Statentre, Universityondon, UK (e-ma
E. J. Morton orrance, CA 9050
L. D. Griffin isollege London, [email protected].
D
he most generake no assump
y. For such a sdetection, is pof auto-encode
mmitted solutiorange of pro
ng the featur) ν -Support Vproblems varyses. Assessed aature to be idual error veupon the use
de, which hasion.
—Anomaly demachines.
I. INTR
data are looseve to a normainfluence wh
n making aexample, a thetation-screeninwith abnormaol, or we matterns across may wish to a way to dire
concerned wition. the successesare feature de engineerine approaches a. Instead, wefrom the datd of unsupervroutes one mgraphical m
eived October 26ed by the Enginunder CASE Awa
rews is with thtistical Science, a
y College Londoail: jerone.andrewis with Rapisca
03 USA (e-mail: es with the Depart66-72 Gower Str.uk).
Detectin
J
al mode of detions regardinsystem, choosiproblematic. er artificial neon to this. We oblems derivedres into one-cVector Machiny in diversity across the ranga late fusion
ectors and theof auto-encod
s previously b
etection, auto-
RODUCTION ely defined as al class. Howhich aspects an evaluationermal imagingng context. W
al body tempemay wish to
the body inddetect peopleect visual inspith this third
s in patterndependent, thung of repres
cannot be use propose to uta without thvised feature may follow:
models, and
6, 2015; revised Jeering and Physard Grant 157760
he Department oand Security Scieon, 66-72 Gowe
[email protected]). an Systems Ltdemorton@rapiscatment of Compu
reet, WC1E 6BT
ng Ano
Jerone T. A.
etecting anomng them othering features, toThe hidden eural networkhave assessed
d from two iclass Radial e (SVM) classof the normage, we find then of hidden raw input sigder residual vbeen proposed
-encoders, one
data items thawever, the spe
of the datan of atypicg system usedWe may wis
erature patterndetect abno
dicating conce with any kinpection efficie
type of ‘ne
n recognition us much effosentative featsed when onese features tha
he need for lalearning, therthe first liethe second
January 12, 2016ical Sciences Re
0 and Rapiscan Sy
of Computer Scence Doctoral Tr
er Street, WC1E ., 2805 Columbansystems.com).uter Science, Univ, London, UK (e
omalous
Andrews, Ed
malous r than o best layer
ks is a these
image Basis ifiers.
al and e best layer
gnals. vector d for
e-class
at are ecific a are cality. d in a sh to ns for ormal cealed nd of ently. utral’
and ort is tures. only at are abels. re are es in d in
6. This esearch ystems
cience, raining
E 6BT,
bia St,
versity e-mail:
compprobdistrreaso
Cuemplneurasses(the featuencocardiinpujustifdetecsampsignaresidthe lfromencoredu‘deepfor Furthnetwhypepossconfnumit isCIFAsinglprovrecogcoun
Fig. N
withthat comppossimag
s Data U
dward J. Mo
putational mobabilistic appribution, whilon, we considurrent solutiloy an unsupe
ral network ssments baseddifference bet
ure to be compoder is compoinality, and a
uts such that fication for ction, is that ples should fal of an anodual magnitudlatent charact
m that of normoders with a suce the compper’ networks
pre-training hermore, ther
work architerparameter oible configur
figurations grmber of hidden
s possible toAR-10 and Nle-layered net
vided in [12] wgnition rates
nterparts.
1. High complex
ow, imagine ahout more qua
the image is posed of imible that the a
ges that are an
Using A
orton, and Le
odels, namelyproach requirle the neural er the neural nons, fitting
ervised auto-enapproach [2
d on the magtween the netwpared to a threosed of inputhidden layerthe outputs
this use of an auto-encofind it difficuomalous sampde to such conteristics of an
mal samples. Insingle hidden putational exps tend to be pr
of each re is the issuetecture seloptimisation rations, wherrows exponenn layers. Moreo attain stateNORB vision tworks. In adwhich shows s
that are co
xity images (left),
an image withlification this normal. If for
mages with hauto-encoder wnomalous in b
Auto-E
ewis D. Griff
y neural netwres a prior network doe
network approthis descrip
ncoder feed-f2]-[9] and
gnitude of thework's input aeshold. A singt and output r that attemptss resemble thauto-encoderder trained soult to reconsple, hence yfigurations, si
nomalous samn our work, wlayer. We dopense of ourrohibitive [10]
hidden laye of the non-tlection whover a size
re the numbntially as w
eover, it is shoe-of-the-art p
tasks, using ddition, empirshallower netwmpetitive wi
and low complex
h a small residmay not be e
r example, ouhigh complexwill find it easbeing of low
Encoder
fin
works [1]. Ton the da
es not; for thoach. ption, typicaforward artificmake anoma residual vectand output), agle-layered autlayers of equ
s to recreate the inputs. Ts, for anomaolely on normstruct the inp
yielding a larince one expe
mples to deviawe only use aut
this in order r system, sin] due to the neyer separatetrivial choice hich requireable space ber of possib
we increase town in [11] thperformance,
only ‘shalloical evidenceworks achievith their deep
xity images (righ
dual magnitudenough evidenur training setxity, then it sy to reconstrucomplexity (s
rs
The ata his
ally cial aly tor s a to-ual the
The aly
mal put rge cts ate to-to
nce eed ely.
of res of
ble the hat in
w’ is ing per
ht).
de, nce t is
is uct see
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
21doi: 10.18178/ijmlc.2016.6.1.565
![Page 2: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/2.jpg)
Ficosa
onreThdimfemhicoleinwvaveco1)
2)
3)
usfranwofcabam
F
ig. 1). That isomplexity imamples.
A criticism on the magnituesiduals of eachis may be isregarding w
magnitude of eatures may b
more, current aidden represeompetent featearning first-onstead employ
we do not needalue. Moreovector, hidden ombinations. O) Using only
under-utilifull vector
) The hiddeneffective ‘n
) Some anrepresentat
To test our hsing two diffeeight containnd MNIST han
we constructedf tight and dases, we consasis of the n
minimal use
Fig. 2. Some typic
s, the magnitumages may be
of making anoude of the resch feature are an impruden
which feature the residual
be more anomapproaches of entation, whicture vector inorder (low-levy a classifier ind to transform ver, this allow
representationOur hypothesey the magnitsing it for thewill give imp
n layer represeneutral’ featurnomalies mations but normhypotheses, werent datasets: ers that may ndwritten dig
d a range of tediverse, normatructed the an
normal class of an an
cal samples from
ude of the resie within the
omaly assessmsidual vector i
summed to gnt thing to d
dimensions vector, and
malous than ithis type fail ch has the pn itself sincevel) features. n place of a ththe residual v
ws us to usen, full residuaes are thus: tude of the e task of anomproved performentations of ares for anomalay have a
mal residuals. we formed sev
low-resolutiobe empty or
its [13]. For eest problems al and anomanomaly detectonly. Our s
nomaly set
the two datasets:freight
idual vector orange for no
ments based sis the fact tha
give a single vdo, since wecontribute to
residuals in in others. Whto make use opotential to e it is capab
Rather, we chreshold, suchvector into a se the input sal vector, and
residual vectmaly detectionmance. an auto-encodely detection. abnormal hi
veral test probon X-ray imagr containing ceach of the datwith combinaaly classes. Ition system oystem does during clas
X-ray transmissicontainers (centr
f low ormal
solely at the value. e are o the some hat is of the be a
ble of could h that single signal
their
tor is n: the
er are
idden
blems ges of cargo, tasets ations In all
on the make
ssifier
conshyperemofromfeatu
Ouemplrecoghiddsignawithconsclassconcclassrepretheiranominpuvectoclassclassattaincomp
WedeteccontaMNIa randiverconsnorm
ion images of nonre), and MNIST h
struction: we erparameters. ove in future
m using an anoure space bounur results loying the fgnition rate
den representaals and residu
h the residuals stituents alonsifier fusion tcatenation. Insifier modelsesentations, anr label outputmaly score. Wut signals, hiors to form sifier model sifier fusion ning higher pared to pre-c
e used two daction system:ainers that mIST handwrittnge of test prse, normal
structed the anmal class only.
n-empty freight chandwritten digits
use it to cThis is a w
work, but weomaly set to cndary. empirically
full residual over using t
ation has benuals; and the h
and input signne. In addititechnique andn the formers trained on nd full residuts, for each s
Whereas in theidden represe
new featureconstruction. technique to
recognitionclassifier featu
II. DAT
atasets (see F: low-resolutimay be emptten digits. Forproblems withand anomaly
nomaly detecti.
containers (left), Xs (right).
choose a smweakness that note that it i
choose feature
support ouvector gives
the residual nefits above uhidden represenal is superiorion, we comd pre-classifier, we built
the input ual vectors, ansample, in ore latter, we centations, ane vector repr
The results be a worka
n rates, parure vector con
TASETS Fig. 2) to asseion X-ray imty or containr each dataseth combinationy classes. Inion system on
X-ray transmissio
mall number t we intend s quite differe
es or a two-cla
ur hypothess an improvmagnitude; tusing the inp
entation togethr to any of tho
mpared a poer feature vect
three separasignals, hidd
nd then combirder to give
concatenated td full residuresentations fshow the po
able alternativrticularly whncatenation.
ess our anomamages of freigning cargo; at we constructns of tight a
n all cases wn the basis of t
on images of emp
of to
ent ass
es: ved the put her ose ost-tor ate
den ine an
the ual for
ost-ve, hen
aly ght and ted and we the
pty
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
22
A
frramThcocotowsapi
B
A. X-Ray TraThis dataset
eight containail car scanne
moving at spehe dataset coontaining cargontainers conto small differe
while cargo imampled the iixels.
B. MNIST HThis widely
ansmission Imt consists of ers, obtained er. The scanneeds of up toonsists of 256go (non-empttaining no cargences in freighmages also vmages, for e
Handwritten Dused dataset c
mages of FreigX-ray transmfrom a Rapi
ner images ino 60km h wi60 images of ty) and 2560go (empty). Aht containers avary in the cease of expe
Digits consists of ha
ght Containersmission imagiscan Eagle ®ndividual railith 5.6mm p
f freight conta images of fr
All images varyand their furncargo. We deriment, to 3
andwritten dig
s es of ®R60 l cars ixels. ainers reight y due
niture, down-32 9×
gits 0
throufull 14 1×
C.Ou
two 1)
2)
3)
4)
ugh 9 , centredataset of 7
14 pixels.
Anomaly Deur six anomaldatasets wereFtd. Normalanomaly classFdt. Norma(diverse); an(tight). Mtt. Normal anomaly classMdt. Normal7, 9} (diverse
ed and scaled70000 sample
etection Test Ply detection te: l class: empts: non-empty
al class: nonnomaly class
class: hands: handwrittenl class: handwe); anomaly cl
d to similar sies. Images w
Problems est problems d
ty freight confreight contain-empty frei: empty fre
dwritten digitsn digits of {2}written digits olass: handwritt
ize. We use twere resized
derived from t
ntainers (tighners (diverse)ight containe
eight containe
s of {5} (tigh (tight). of odds {1, 3,ten digits of {
the to
the
ht); ). ers ers
ht);
, 5, {2}
![Page 3: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/3.jpg)
(tight). 5) Mtd. Normal class: handwritten digits of {5} (tight);
anomaly class: handwritten digits of evens {0, 2, 4, 6, 8} (diverse).
6) Mdd. Normal class: handwritten digits of odds {1, 3, 5, 7, 9} (diverse); anomaly class: handwritten digits of evens {0, 2, 4, 6, 8} (diverse).
For each of these test problems, we sampled without replacement the following sets: Training set. 2048 images from the normal class. Validation set. 2048 images from the anomaly class. Testing set. 512 images from the normal class and
512 images from the anomaly class.
III. ANOMALY DETECTION FRAMEWORK In this section, we will describe our anomaly detection
framework. Although our approach is quite general we will focus on components suitable for images.
Our system performs the following steps given a set of p
training images x1,...,x p :
1) Learn a feature encoding, of the training images, using an unsupervised sparse feed-forward neural network auto-encoder.
2) Extract features for each image using the trained sparse auto-encoder.
3) Train a one-class non-linear Radial Basis Function ν -Support Vector Machine (RBF SVM) [14], [15] classifier to predict the label, normal or anomalous, given the computed features.
Next, we will go on to describing the steps of our system in finer detail.
A single-layered auto-encoder is a type of feed-forward artificial neural network with one hidden layer. An auto-encoder is trained to reconstruct its input signal by finding useful features from the input space. The auto-encoder learns a map from input to representation, where the representation consists of the activations of the m hidden layer units. Concretely, given an input x ∈ Rn , the auto-encoder computes an output y ∈ Rn , via a hidden layer
representation f ∈ Rm . The hidden layer activations are computed from the input according to 1 1( ) ( )f g= +x Wx b , and the output layer from the hidden layer according to y = g(W2 f (x) + b2 ) . Where W1 ∈ Rm×n and W2 ∈ Rn×m are
weight matrices, b1 ∈ Rm and b2 ∈ Rn are bias vectors, and g(z) =1/ (1+ exp(−z)) is our chosen activation function applied to the vector z component-wise.
Auto-encoders apply back-propagation, for training W1,W2 ,b1 , and b2 , by means of gradient descent, in an attempt to achieve y ≈ x for the training data. If the hidden layer has dimensionality not less than the input and output layers ( m ≥ n ) then it is trivial for the auto-encoder to succeed in exactly reproducing its inputs, and nothing of use is learnt. However, if the hidden layer activations are
encouraged to be sparse then its units are forced to learn significant structures within the data. Imposing sparsity, we have a minimisation problem of the form:
minimise xi − yi2 + β KL ρ ρ̂ j( )
j=1
m
∑i=1
p
∑⎧⎨⎪
⎩⎪
⎫⎬⎪
⎭⎪, (1)
where KL ρ ρ̂ j( ) = ρ ln ρ / ρ̂ j( ) + 1− ρ( ) ln 1− ρ( ) / 1− ρ̂ j( )( ) is
our sparsity penalty term, ρ̂ j is the average activation over
the whole training set for hidden unit j , ρ is our (desired)
sparsity level parameter which we constrain ρ̂ j to
approximate, and β is used to control the weight of the sparsity penalty term.
TABLE I: SPARSE AUTO-ENCODER HYPERPARAMETER VALUES
Ftd Fdt Mtt Mdt Mtd Mdd m 576 144 392 392 98 98 ρ 0.1 0.5 0.5 0.5 0.25 0.25 β 4 16 4 4 4 4
λ 10−1 10−3 10−5 10−5 10−5 10−5
Evidently, there are several hyperparameters associated
with the simple single-layered sparse auto-encoders we employ: m (number of units in the hidden layer, excluding the bias unit), λ (weight decay) used during back-propagation, ρ (sparsity level), and β (weight of sparsity penalty term), which were all set to reasonable values based on preliminary results for each test problem. For reproducibility our chosen set of hyperparameter values,
m,ρ,β ,λ{ }, are listed in Table I for each of the six anomaly
detection test problems.
We use our sparse auto-encoder to compute several types of features for use in anomaly detection:
Input (INP): x , (2)
Hidden Representation (HR): f (x) , (3)
Scalar Residual Magnitude (SRM): x − y1, (4)
Signed Residual (R): x − y , (5)
Absolute Residual (AR): x − y , (6)
Squared Residual (SR): x − y( )2, (7)
Normalised Signed Residual (NR): x − y( ) σ , and (8)
Normalised Squared Residual (NSR): x − y( )2σ
2
, (9)
where σ = (1/ p) xk − yk( )2
k=1
p∑ is the vector of root-
mean-square residuals between input and output across the training set. All of the features set out above are vectors of
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
23
A. Unsupervised Sparse Auto-Encoder Feature Learning
B. Features
![Page 4: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/4.jpg)
dimension m , except for the scalar residual magnitude feature.
Consider our set of training samples x1,...,x p and
suppose that these samples are drawn from a probability distribution P , in the feature space. We apply a non-linear one-class classification algorithm, namely a Radial Basis Function ν -Support Vector Machine, [14], [15] in an attempt to estimate the support of this distribution.
This one-class formulation, of the standard two-class SVM procedure, first transforms the feature vector via a non-linear RBF kernel, where the origin is viewed as the sole member of the unknown second class. The one-class SVM gives a function h that outputs +1 in a region that encompasses most of the training samples, and outputs −1 everywhere else.
The objective function to separate the training samples from the origin of our one-class RBF SVM classifier is the following quadratic programming minimisation task:
minω,ξi ,q12
ω 2 + 1ν p
ξi − qi=1
p
∑ (10)
subject to: ω ⋅φ xi( )( ) ≥ q −ξi,∀i =1,..., p, (11)
ξi ≥ 0,∀i =1,..., p, (12)
where φ is a kernel mapping of x i into a dot product space F , q is a bias term, and ν ∈ (0,1) is an upper bound on the fraction of the training samples that are considered to be out-of-class and a lower bound on the fraction of training samples used as Support Vectors (SVs).
Our decision rule, having solved the quadratic programming minimisation problem using Lagrange multipliers, is:
h x( ) = sgn αi ⋅ K x, xi( )i=1
p
∑ − q⎛
⎝⎜
⎞
⎠⎟ , (13)
where the αi are the Lagrange multipliers, and
K xi, x j( ) = exp −γ ⋅ xi − x j2( ),γ > 0, (14)
is the RBF kernel with width parameter γ .
There is no clear means in which to differentiate between alternative one-class RBF SVM models, in the case of classification accuracy ties, when testing on the validation set. Therefore, we employ an ensemble composed of the winning models and take the majority output as the class label. Formally, our procedure for selecting hyperparameter values for ν and γ is as follows: 1) Sample without replacement 1024 normal samples
from the training set. 2) Sample 512 normal samples from the training set and
512 anomalous samples from the validation set.
3) Perform a grid-search over the hyperparameters, and then select all (in the case of ties) hyperparameter tuples that give rise to the highest classification accuracy, for the samples in Step 2.
4) Go back to Step 1 and repeat this process twice more. We now have a set of hyperparameter tuples ν ,γ{ } . We
will train an ensemble of one-class RBF SVMs (using the set of tuples ν ,γ{ } ), on the entire training set of 2048
samples, such that they are now ready to predict the labels of unseen test samples. We combine the label outputs of each one-class RBF SVM, in the ensemble, by taking the majority output as the predicted label of an unseen test sample.
To be clear, we have made use of an anomaly set in order to select hyperparameter values for our one-class RBF SVMs. We emphasise that this is a deficiency within our system, however we do not use the anomaly set to choose features or a two-class feature space boundary. We aim to remove the need for anomalous samples, for classifier hyperparameter optimisation, in future work.
For this study, we utilised the readily available LIBSVM toolbox (Version 3.20) [16], which is an integrated piece of software with built-in distribution estimation (one-class SVM).
IV. EXPERIMENTS AND ANALYSIS Using our anomaly detection framework outlined in
Section III, the experiments that are reported in this paper are as follows: 1) We began by performing a comparison of the features,
(2)-(9), across all six test problems specified in Section III.A. From this it will be possible to evaluate which of the features are better suited as representations for the concept of normality. Furthermore, we may assess whether anomalous samples, do in some cases, give rise to abnormal auto-encoder features but normal residuals, whilst others have abnormal residuals but normal auto-encoder features.
2) Lastly, we performed a comparison of two different ways of combining the best performing feature vectors, namely:
• Pre-classifier feature vector concatenation where we combined the best feature vectors into a single new feature vector.
• Post-classifier fusion whereby a different sparse auto-encoder is trained on each of the chosen feature vectors separately and then we combined the results of the various one-class RBF SVMs to determine whether a sample is anomalous or not.
Our experiments first considered the usefulness of the features: (2)-(9). Table II shows the classification accuracy of each ensemble of one-class RBF SVMs, across the six test problems, on the unseen testing sets described in Section II.C. There are several notable observations that can be taken from Table II: 1) The hidden representation is on average the best
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
24
C. One-Class RBF SVM Classification
A. Comparison of Features
![Page 5: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/5.jpg)
performing feature vector across all six test problems, and most notably it outperforms the raw input signal.
2) The best performing residual is the normalised signed residual vector, most importantly showing itself to be better than the scalar residual magnitude feature.
3) The normalised signed residual vector has superiority over the hidden representation when the normal class is tight or the anomaly class is diverse, exemplified in the comparison of their scores for Ftd with Fdt. For Ftd the hidden representation scores 94.43% while the normalised signed residual vector scores 98.24% ; for Fdt the scores are 92.99% and 73.34% , respectively.
for comparison of features
Having established that input features (2), hidden layer features (3), and reconstruction errors—encoded as normalised signed residuals (8)—are all sometimes effective, but each on different problems, we then considered the most effective means of combining them. We compared pre-classifier feature vector concatenation and post-classifier fusion using a combination of these three features.
1) Pre-Classifier Feature Vector Concatenation (PRE) 1) Create combined feature vectors for the training data
comprised of the features derived from (2), (3), and (8)
by concatenation. 2) Scale feature dimensions so that each has zero-mean
and unit-variance across the training set. 3) Retrieve the predicted class labels for the testing set
given by the ensemble of one-class RBF SVMs trained on normal data, with the combined feature vectors, and take the mode of the predictions, for each sample, as the label.
2) Post-Classifier Fusion (POST): 1) Retrieve the predicted class labels for the testing set
given by the ensemble of one-class RBF SVMs trained on normal data with feature vectors of the form (2). Then compute the mean label, for each sample, given the predictions from the ensemble.
2) Repeat Step 1, but this time for feature vectors of the form (3).
3) Repeat Step 1, but this time for feature vectors of the form (8).
4) Compute the mean of the labels given by Steps 1--3, then label a sample as normal if the output is positive, otherwise label as anomalous. The results in Table III show post-classifier fusion to be
superior to using any of the stand-alone feature vectors: (2), (3), or (8). In contrast, pre-classifier feature vector concatenation is on average the worst performing approach, receiving the lowest classification accuracy.
TABLE II: TEST CLASSIFICATION ACCURACY (%) ON THE TEST PROBLEMS (BOLD INDICATES THE HIGHEST ACCURACY WITHIN A ROW)
Test Problem INP HR SRM R AR SR NR NSR Ftd 98.24 94.43 95.02 98.24 98.93 98.54 98.24 99.22 Fdt 80.37 92.99 81.15 55.08 48.14 50.78 73.34 48.44 Mtt 91.11 91.21 84.96 86.91 88.77 83.59 91.80 90.43 Mdt 86.23 85.06 70.70 79.69 81.35 78.71 83.40 80.08 Mtd 85.74 88.77 87.60 50.00 50.00 86.91 86.33 83.01 Mdd 78.52 75.29 80.66 50.00 50.00 70.12 75.59 74.80
Average 86.70 87.96 83.35 69.99 69.53 78.11 84.78 79.33
TABLE III: TEST CLASSIFICATION ACCURACY (%) ON THE TEST PROBLEMS (BOLD INDICATES THE HIGHEST CLASSIFICATION ACCURACY WITHIN A ROW)
FOR FEATURE VECTOR COMBINATION
Test Problem INP HR SRM NR PRE POST Ftd 98.24 94.43 95.02 98.24 98.24 97.66 Fdt 80.37 92.99 81.15 73.34 51.86 89.55 Mtt 91.11 91.21 84.96 91.80 92.19 92.29 Mdt 86.23 85.06 70.70 83.40 82.91 87.01 Mtd 85.74 88.77 87.60 86.33 86.82 89.75 Mdd 78.52 75.29 80.66 75.59 72.95 78.22
Average 86.70 87.96 83.35 84.78 80.83 89.08
V. SUMMARY In this empirical study, on the detection of anomalous
data using auto-encoders, we have conducted several experiments on a range of anomaly detection test problems. Our experiments compared a selection of different feature vectors derived from a sparse auto-encoder feed-forward neural network. The empirical results appear to support our hypothesis that there is indeed a better way to use residual errors than simply computing the magnitude, and this is most apparent when the normalised signed residual is employed. Furthermore, the results suggest that the hidden layer representation, as a stand-alone feature vector, is more than capable of characterising the fundamental attributes of
normality. Its competence is shown through it having the highest average recognition rate amongst the stand-alone feature vectors. The robustness of the hidden layer representation is best illustrated in situations where the auto-encoder does not struggle to reconstruct anomalous images, and so gives rise to low residuals. For instance, in Fdt, where we have a diverse normal class of high-complexity non-empty cargo containers and an anomaly class of empty cargo containers. In this test problem the auto-encoder is able to reconstruct the low-complexity empty cargo containers, despite having never been trained to do so. However, the units of the hidden representation are activated in an abnormal fashion, and as such we are able to identify a greater number of anomalous images, as opposed to using
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
25
B. Feature Vector Combination
![Page 6: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism](https://reader035.fdocuments.in/reader035/viewer/2022070912/5fb45404fde0482e5b6b7815/html5/thumbnails/6.jpg)
thcoswaldicocoapreefusbethth
anclwhya reun
in
[1
[2
[3
[4
[5
he residuals toonverse test prwapped. We sll the featureifficult to enontainer imagontainers, wppearance. Noesidual, the hidffectively whesing a post-claenefit from thhat is, some ahose constituen
Our aim in nomaly detectlass. We note
we make use yperparametervalidation se
emove in futnsupervised sy
J. T. A. Anvaluable supp
] Y. Bengio, Areview and nMachine Inte
] N. Japkowicapproach to con Artificial I
] J. M. Ko, J. “Studies of vibridges in HEngineering a
] L. Manevitz neural netwo2007.
] H. Sohn, K. changing env
o make a prroblem, Ftd, w
see in this case vectors, sinncode a moge having b
which are onetheless, thdden represenen all three arassifier fusion
heir individualanomalies mants. this work w
tion without that we devia of an anomrs for one-clas
et is not idealture work soystem.
ACKNOW
Andrews thankport and assist
REFE
A. Courville, and new perspectives,lligence, vol. 35,
cz, C. Myers, aclassification,” inIntelligence, 1995M., Y. Q. Ni, J. ibration-based daHong Kong,” inand Technologicaand M. Yousef,
orks,” Neurocomp
Worden, and Cvironmental con
ediction. Thiswhere the clase a performannce the autoore-complex been trained
relatively hhe use of the ntation and there used in tandn. By doing thl anomaly detay only be a
was to considethe knowledgated from thismaly validatss RBF SVMsl, it is a com
o as to move
WLEDGMENT ks Emma F.tance in review
ERENCES P. Vincent, “Rep” IEEE Trans. ono. 8, pp. 1798-1
and M. Gluck, n Proc. Internati5, pp. 518-523.
Y. Wang, Z. G.amage detection on Proc. Internaal Sciences, Chin“One-class docu
puting, vol. 70,
C. R. Farrar, “Nnditions,” in Pr
s differs fromss labels havence increase ao-encoder fin
non-empty con empty c
homogeneousnormalised si
e input signal dem and combhis, we are abection capabi
apparent in on
er the problege of the anos aim at one sion set to ss. Whilst the u
mponent we aie towards a
. Shapiro forwing this work
presentation learnn Pattern Analys1828, March 2013“A novelty det
ional Joint Conf
. Sun, and X. T. of three cable-supational Conferena, 2000, pp. 105-
ument classificatino. 7, pp. 1466
ovelty detection roc. SPIE 8th A
m the been
across nds it cargo cargo s in igned work bined ble to lities, ne of
em of omaly stage: select use of im to truly
r her k.
ning: A sis and 3. tection ference
Zhou, pported nce on -112. ion via 6-1481,
under Annual
JemMLoapUCotovi
erone T. A. Amathematics from MRes in security
ondon, UK. He ispplied mathematK, jointly supomputer Science
opics of interest ision, similarity le
Andrews has anKing’s College L
y science from s currently workintics at Universitypervised by thee and Statistical are anomaly detearning, and dens
n MSci (Hons) London, UK andUniversity Collng towards a PhDy College Londe Departments Science. His m
tection in compusity estimation.
in d an ege
D in don,
of main uter
International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016
26
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
International Sympp. 108-118. R. J. Streifel, R. “Detection of shrotors using novTransactions on1996. C. Surace, K. Wapproach to diagSPIE, vol. 3089,C. Surace and Kdamage in structuEighth Internatio1998, vol. 4, pp. K. Worden, “StJournal of SoundA. Saxe, P. W. K“On random wei28th Internationa1096. A. Coates, A. Ynetworks in unsConference on A223. J. Ba and R. CAdvances in Neu2014. � Y. LeCun, C. Cdatabase of http://yann.lecunB. Schölkopf, R.C. Platt, “Suppo12, pp. 582-588,B. Schölkopf, J.Williamson, “Edistribution,” Ne2001. C. C. Chang, anmachines,” ACTechnology, vol.
mposium on Sma
J. Marks II, M. Ahorted-turns in thvelty detectors-d
n Energy Conver
Worden, and G.gnose damage inpp. 947-953, 199
K. Worden, “A noures: an applicational Conference 64–70.
tructural fault ded and Vibration, vKoh, Z. Chen, Mghts and unsuperal Conference on
Y. Ng, and H. supervised featurArtificial Intellig
Caruana, “Do deural Information
Cortes, and C. Jhandwritten
n.com/exdb/mnist. C. Williamson, ort vector method
1999. . C. Platt, J. ShaEstimating the eural Computatio
nd C. J. Lin, “LIBCM Transaction
2, no. 3, pp. 27,
art Structures an
A. El-Sharkawi, ahe field winding odevelopment andrsion, vol. 11, n
. Tomlinson, “An a cracked beam97. ovelty detection ion to an offshoreof Off-shore and
etection using avol. 201, no. 1, pp
M. Bhand, B. Surrvised feature lean Machine Learni
Lee, “An analyre learning,” in
gence and Statist
eep nets really Processing Syste
J. C. Burges. (1digits. [On
t/ A. J. Smola, J. Sd for novelty det
awe-Taylor, A. Jsupport of a
on, vol. 13, no.
BSVM: A libraryns on Intellig2011.
nd Materials, 20
and I. Kerszenbauof turbine-generad field test,” IEno. 2, pp. 312-3
A novelty detectm,” Proceedings
method to diagne platform,” in Prd Polar Engineeri
novelty measurp. 85-101, 1997. resh, and A. Y. Nrning,” in Proc. ing, 2011, pp. 10
ysis of single-laProc. Internatio
tics, 2011, pp. 2
need to be deepems, pp. 2654-26
1998). The MNInline]. Availab
Shawe-Taylor, andtection,” NIPS, v
J. Smola, and R.a high-dimensio
7, pp. 1443-14
y for support vecgent Systems a
001,
um, ator
EEE 317,
tion s of
nose roc. ing,
re,”
Ng, the 89-
ayer onal 15-
p?” 662,
IST ble:
d J. vol.
C. onal 471,
ctor and