Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co...

6
da th su re po fe da Fu O an pe ac Th er an su at sc co C m de di te th ab H an cl pl H ha di W tw pr wo Co Lt De Ce Lo To Co l.g Abstract—Th ata would mak heir atypicality upport the d epresentation o otential uncom atures on a atasets, feedin unction (RBF) ur range of p nomalous class erforming fea ctivations, resi his improves rror magnitud nomaly detecti Index Termsupport vector m Anomalous d typical relativ cenario can onsidered in onsider, for e mass transport etect people w isease contro emperature pa hreats, or we m bnormality as Here we are c nomaly detect Many of t lassification a laced on th However, these as normal data irectly learnt Within the fiel wo distinct r robabilistic Manuscript rece ork was supporte ouncil (EPSRC) u td. J. T. A. Andr epartment of Stat entre, University ondon, UK (e-ma E. J. Morton orrance, CA 9050 L. D. Griffin is ollege London, 6 g[email protected]. D he most genera ke no assump y. For such a s detection, is p of auto-encode mmitted solutio range of pro ng the featur ) ν -Support V problems vary ses. Assessed a ature to be idual error ve upon the use de, which has ion. Anomaly de machines. I. INTR data are loose ve to a norma influence wh n making a example, a the tation-screenin with abnorma ol, or we m atterns across may wish to a way to dire concerned wi tion. the successes are feature d e engineerin e approaches a. Instead, we from the dat d of unsuperv routes one m graphical m eived October 26 ed by the Engin under CASE Awa rews is with th tistical Science, a y College Londo ail: jerone.andrew is with Rapisca 03 USA (e-mail: e s with the Depart 66-72 Gower Str .uk). Detectin J al mode of de tions regardin system, choosi problematic. er artificial ne on to this. We oblems derived res into one-c Vector Machin y in diversity across the rang a late fusion ectors and the of auto-encod s previously b etection, auto- RODUCTION ely defined as al class. How hich aspects an evaluation ermal imaging ng context. W al body tempe may wish to the body ind detect people ect visual insp ith this third s in pattern dependent, thu ng of repres cannot be us e propose to u ta without th vised feature may follow: models, and 6, 2015; revised J eering and Phys ard Grant 157760 he Department o and Security Scie on, 66-72 Gowe w[email protected]). an Systems Ltd emorton@rapisca tment of Compu eet, WC1E 6BT ng Ano Jerone T. A. etecting anom ng them other ing features, to The hidden eural network have assessed d from two i class Radial e (SVM) class of the norma ge, we find the n of hidden raw input sig der residual v been proposed -encoders, one data items tha wever, the spe of the data n of atypic g system used We may wis erature pattern detect abno dicating conc e with any kin pection efficie type of ‘ne n recognition us much effo sentative feat sed when one se features tha he need for la learning, ther the first lie the second January 12, 2016 ical Sciences Re 0 and Rapiscan Sy of Computer Sc ence Doctoral Tr er Street, WC1E ., 2805 Columb ansystems.com). uter Science, Univ , London, UK (e omalous Andrews, Ed malous r than o best layer ks is a these image Basis ifiers. al and e best layer gnals. vector d for e-class at are ecific a are cality. d in a sh to ns for ormal cealed nd of ently. utral’ and ort is tures. only at are abels. re are es in d in 6. This esearch ystems cience, raining E 6BT, bia St, versity e-mail: comp prob distr reaso Cu empl neur asses (the featu enco cardi inpu justif detec samp signa resid the l from enco redu ‘deep for Furth netw hype poss conf num it is CIFA singl prov recog coun Fig. N with that comp poss imag s Data U dward J. Mo putational mo babilistic app ribution, whil on, we consid urrent soluti loy an unsupe ral network ssments based difference bet ure to be comp oder is compo inality, and a uts such that fication for ction, is that ples should f al of an ano dual magnitud latent charact m that of norm oders with a s uce the comp per’ networks pre-training hermore, ther work archit erparameter o ible configur figurations gr mber of hidden s possible to AR-10 and N le-layered net vided in [12] w gnition rates nterparts. 1. High complex ow, imagine a hout more qua the image is posed of im ible that the a ges that are an Using A orton, and Le odels, namely proach requir le the neural er the neural n ons, fitting ervised auto-en approach [2 d on the mag tween the netw pared to a thre osed of input hidden layer the outputs this use of an auto-enco find it difficu omalous samp de to such con teristics of an mal samples. In single hidden putational exp s tend to be pr of each re is the issue tecture sel optimisation rations, wher rows exponen n layers. More o attain state NORB vision tworks. In ad which shows s that are co xity images (left), an image with lification this normal. If for mages with h auto-encoder w nomalous in b Auto-E ewis D. Griff y neural netw res a prior network doe network appro this descrip ncoder feed-f 2]-[9] and gnitude of the work's input a eshold. A sing t and output r that attempts s resemble th auto-encoder der trained so ult to recons ple, hence y figurations, si nomalous sam n our work, w layer. We do pense of our rohibitive [10] hidden lay e of the non-t lection wh over a size re the numb ntially as w eover, it is sho e-of-the-art p tasks, using ddition, empir shallower netw mpetitive wi and low complex h a small resid may not be e r example, ou high complex will find it eas being of low Encoder fin works [1]. T on the da es not; for th oach. ption, typica orward artific make anoma residual vect and output), a gle-layered aut layers of equ s to recreate t he inputs. T s, for anoma olely on norm struct the inp yielding a lar ince one expe mples to devia we only use aut this in order r system, sin ] due to the ne yer separate trivial choice hich requir eable space ber of possib we increase t own in [11] th performance, only ‘shallo ical evidence works achievi th their deep xity images (righ dual magnitud enough eviden ur training set xity, then it sy to reconstru complexity (s rs The ata his ally cial aly tor s a to- ual the The aly mal put rge cts ate to- to nce eed ely. of res of ble the hat in w’ is ing per ht). de, nce t is is uct see International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016 21 doi: 10.18178/ijmlc.2016.6.1.565

Transcript of Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co...

Page 1: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

dathsurepofedaFuOanpeacTheran

su

atsccoCmdeditethabHan

clplHhadiWtwpr

woCoLt

DeCeLo

To

Col.g

Abstract—Thata would makheir atypicalityupport the depresentation ootential uncomatures on a atasets, feedinunction (RBF)ur range of p

nomalous classerforming feactivations, resihis improves rror magnitudnomaly detecti

Index Terms—upport vector m

Anomalous dtypical relativcenario can onsidered inonsider, for e

mass transportetect people wisease controemperature pahreats, or we mbnormality as

Here we are cnomaly detect

Many of tlassification alaced on th

However, theseas normal datairectly learnt

Within the fielwo distinct rrobabilistic

Manuscript receork was supporteouncil (EPSRC) utd.

J. T. A. Andrepartment of Statentre, Universityondon, UK (e-ma

E. J. Morton orrance, CA 9050

L. D. Griffin isollege London, [email protected].

D

he most generake no assump

y. For such a sdetection, is pof auto-encode

mmitted solutiorange of pro

ng the featur) ν -Support Vproblems varyses. Assessed aature to be idual error veupon the use

de, which hasion.

—Anomaly demachines.

I. INTR

data are looseve to a normainfluence wh

n making aexample, a thetation-screeninwith abnormaol, or we matterns across may wish to a way to dire

concerned wition. the successesare feature de engineerine approaches a. Instead, wefrom the datd of unsupervroutes one mgraphical m

eived October 26ed by the Enginunder CASE Awa

rews is with thtistical Science, a

y College Londoail: jerone.andrewis with Rapisca

03 USA (e-mail: es with the Depart66-72 Gower Str.uk).

Detectin

J

al mode of detions regardinsystem, choosiproblematic. er artificial neon to this. We oblems derivedres into one-cVector Machiny in diversity across the ranga late fusion

ectors and theof auto-encod

s previously b

etection, auto-

RODUCTION ely defined as al class. Howhich aspects an evaluationermal imagingng context. W

al body tempemay wish to

the body inddetect peopleect visual inspith this third

s in patterndependent, thung of repres

cannot be use propose to uta without thvised feature may follow:

models, and

6, 2015; revised Jeering and Physard Grant 157760

he Department oand Security Scieon, 66-72 Gowe

[email protected]). an Systems Ltdemorton@rapiscatment of Compu

reet, WC1E 6BT

ng Ano

Jerone T. A.

etecting anomng them othering features, toThe hidden eural networkhave assessed

d from two iclass Radial e (SVM) classof the normage, we find then of hidden raw input sigder residual vbeen proposed

-encoders, one

data items thawever, the spe

of the datan of atypicg system usedWe may wis

erature patterndetect abno

dicating conce with any kinpection efficie

type of ‘ne

n recognition us much effosentative featsed when onese features tha

he need for lalearning, therthe first liethe second

January 12, 2016ical Sciences Re

0 and Rapiscan Sy

of Computer Scence Doctoral Tr

er Street, WC1E ., 2805 Columbansystems.com).uter Science, Univ, London, UK (e

omalous

Andrews, Ed

malous r than o best layer

ks is a these

image Basis ifiers.

al and e best layer

gnals. vector d for

e-class

at are ecific a are cality. d in a sh to ns for ormal cealed nd of ently. utral’

and ort is tures. only at are abels. re are es in d in

6. This esearch ystems

cience, raining

E 6BT,

bia St,

versity e-mail:

compprobdistrreaso

Cuemplneurasses(the featuencocardiinpujustifdetecsampsignaresidthe lfromencoredu‘deepfor Furthnetwhypepossconfnumit isCIFAsinglprovrecogcoun

Fig. N

withthat comppossimag

s Data U

dward J. Mo

putational mobabilistic appribution, whilon, we considurrent solutiloy an unsupe

ral network ssments baseddifference bet

ure to be compoder is compoinality, and a

uts such that fication for ction, is that ples should fal of an anodual magnitudlatent charact

m that of normoders with a suce the compper’ networks

pre-training hermore, ther

work architerparameter oible configur

figurations grmber of hidden

s possible toAR-10 and Nle-layered net

vided in [12] wgnition rates

nterparts.

1. High complex

ow, imagine ahout more qua

the image is posed of imible that the a

ges that are an

Using A

orton, and Le

odels, namelyproach requirle the neural er the neural nons, fitting

ervised auto-enapproach [2

d on the magtween the netwpared to a threosed of inputhidden layerthe outputs

this use of an auto-encofind it difficuomalous sampde to such conteristics of an

mal samples. Insingle hidden putational exps tend to be pr

of each re is the issuetecture seloptimisation rations, wherrows exponenn layers. Moreo attain stateNORB vision tworks. In adwhich shows s

that are co

xity images (left),

an image withlification this normal. If for

mages with hauto-encoder wnomalous in b

Auto-E

ewis D. Griff

y neural netwres a prior network doe

network approthis descrip

ncoder feed-f2]-[9] and

gnitude of thework's input aeshold. A singt and output r that attemptss resemble thauto-encoderder trained soult to reconsple, hence yfigurations, si

nomalous samn our work, wlayer. We dopense of ourrohibitive [10]

hidden laye of the non-tlection whover a size

re the numbntially as w

eover, it is shoe-of-the-art p

tasks, using ddition, empirshallower netwmpetitive wi

and low complex

h a small residmay not be e

r example, ouhigh complexwill find it easbeing of low

Encoder

fin

works [1]. Ton the da

es not; for thoach. ption, typicaforward artificmake anoma residual vectand output), agle-layered autlayers of equ

s to recreate the inputs. Ts, for anomaolely on normstruct the inp

yielding a larince one expe

mples to deviawe only use aut

this in order r system, sin] due to the neyer separatetrivial choice hich requireable space ber of possib

we increase town in [11] thperformance,

only ‘shalloical evidenceworks achievith their deep

xity images (righ

dual magnitudenough evidenur training setxity, then it sy to reconstrucomplexity (s

rs

The ata his

ally cial aly tor s a to-ual the

The aly

mal put rge cts ate to-to

nce eed ely.

of res of

ble the hat in

w’ is ing per

ht).

de, nce t is

is uct see

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

21doi: 10.18178/ijmlc.2016.6.1.565

Page 2: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

Ficosa

onreThdimfemhicoleinwvaveco1)

2)

3)

usfranwofcabam

F

ig. 1). That isomplexity imamples.

A criticism on the magnituesiduals of eachis may be isregarding w

magnitude of eatures may b

more, current aidden represeompetent featearning first-onstead employ

we do not needalue. Moreovector, hidden ombinations. O) Using only

under-utilifull vector

) The hiddeneffective ‘n

) Some anrepresentat

To test our hsing two diffeeight containnd MNIST han

we constructedf tight and dases, we consasis of the n

minimal use

Fig. 2. Some typic

s, the magnitumages may be

of making anoude of the resch feature are an impruden

which feature the residual

be more anomapproaches of entation, whicture vector inorder (low-levy a classifier ind to transform ver, this allow

representationOur hypothesey the magnitsing it for thewill give imp

n layer represeneutral’ featurnomalies mations but normhypotheses, werent datasets: ers that may ndwritten dig

d a range of tediverse, normatructed the an

normal class of an an

cal samples from

ude of the resie within the

omaly assessmsidual vector i

summed to gnt thing to d

dimensions vector, and

malous than ithis type fail ch has the pn itself sincevel) features. n place of a ththe residual v

ws us to usen, full residuaes are thus: tude of the e task of anomproved performentations of ares for anomalay have a

mal residuals. we formed sev

low-resolutiobe empty or

its [13]. For eest problems al and anomanomaly detectonly. Our s

nomaly set

the two datasets:freight

idual vector orange for no

ments based sis the fact tha

give a single vdo, since wecontribute to

residuals in in others. Whto make use opotential to e it is capab

Rather, we chreshold, suchvector into a se the input sal vector, and

residual vectmaly detectionmance. an auto-encodely detection. abnormal hi

veral test probon X-ray imagr containing ceach of the datwith combinaaly classes. Ition system oystem does during clas

X-ray transmissicontainers (centr

f low ormal

solely at the value. e are o the some hat is of the be a

ble of could h that single signal

their

tor is n: the

er are

idden

blems ges of cargo, tasets ations In all

on the make

ssifier

conshyperemofromfeatu

Ouemplrecoghiddsignawithconsclassconcclassrepretheiranominpuvectoclassclassattaincomp

WedeteccontaMNIa randiverconsnorm

ion images of nonre), and MNIST h

struction: we erparameters. ove in future

m using an anoure space bounur results loying the fgnition rate

den representaals and residu

h the residuals stituents alonsifier fusion tcatenation. Insifier modelsesentations, anr label outputmaly score. Wut signals, hiors to form sifier model sifier fusion ning higher pared to pre-c

e used two daction system:ainers that mIST handwrittnge of test prse, normal

structed the anmal class only.

n-empty freight chandwritten digits

use it to cThis is a w

work, but weomaly set to cndary. empirically

full residual over using t

ation has benuals; and the h

and input signne. In addititechnique andn the formers trained on nd full residuts, for each s

Whereas in theidden represe

new featureconstruction. technique to

recognitionclassifier featu

II. DAT

atasets (see F: low-resolutimay be emptten digits. Forproblems withand anomaly

nomaly detecti.

containers (left), Xs (right).

choose a smweakness that note that it i

choose feature

support ouvector gives

the residual nefits above uhidden represenal is superiorion, we comd pre-classifier, we built

the input ual vectors, ansample, in ore latter, we centations, ane vector repr

The results be a worka

n rates, parure vector con

TASETS Fig. 2) to asseion X-ray imty or containr each dataseth combinationy classes. Inion system on

X-ray transmissio

mall number t we intend s quite differe

es or a two-cla

ur hypothess an improvmagnitude; tusing the inp

entation togethr to any of tho

mpared a poer feature vect

three separasignals, hidd

nd then combirder to give

concatenated td full residuresentations fshow the po

able alternativrticularly whncatenation.

ess our anomamages of freigning cargo; at we constructns of tight a

n all cases wn the basis of t

on images of emp

of to

ent ass

es: ved the put her ose ost-tor ate

den ine an

the ual for

ost-ve, hen

aly ght and ted and we the

pty

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

22

A

frramThcocotowsapi

B

A. X-Ray TraThis dataset

eight containail car scanne

moving at spehe dataset coontaining cargontainers conto small differe

while cargo imampled the iixels.

B. MNIST HThis widely

ansmission Imt consists of ers, obtained er. The scanneeds of up toonsists of 256go (non-empttaining no cargences in freighmages also vmages, for e

Handwritten Dused dataset c

mages of FreigX-ray transmfrom a Rapi

ner images ino 60km h wi60 images of ty) and 2560go (empty). Aht containers avary in the cease of expe

Digits consists of ha

ght Containersmission imagiscan Eagle ®ndividual railith 5.6mm p

f freight conta images of fr

All images varyand their furncargo. We deriment, to 3

andwritten dig

s es of ®R60 l cars ixels. ainers reight y due

niture, down-32 9×

gits 0

throufull 14 1×

C.Ou

two 1)

2)

3)

4)

ugh 9 , centredataset of 7

14 pixels.

Anomaly Deur six anomaldatasets wereFtd. Normalanomaly classFdt. Norma(diverse); an(tight). Mtt. Normal anomaly classMdt. Normal7, 9} (diverse

ed and scaled70000 sample

etection Test Ply detection te: l class: empts: non-empty

al class: nonnomaly class

class: hands: handwrittenl class: handwe); anomaly cl

d to similar sies. Images w

Problems est problems d

ty freight confreight contain-empty frei: empty fre

dwritten digitsn digits of {2}written digits olass: handwritt

ize. We use twere resized

derived from t

ntainers (tighners (diverse)ight containe

eight containe

s of {5} (tigh (tight). of odds {1, 3,ten digits of {

the to

the

ht); ). ers ers

ht);

, 5, {2}

Page 3: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

(tight). 5) Mtd. Normal class: handwritten digits of {5} (tight);

anomaly class: handwritten digits of evens {0, 2, 4, 6, 8} (diverse).

6) Mdd. Normal class: handwritten digits of odds {1, 3, 5, 7, 9} (diverse); anomaly class: handwritten digits of evens {0, 2, 4, 6, 8} (diverse).

For each of these test problems, we sampled without replacement the following sets: Training set. 2048 images from the normal class. Validation set. 2048 images from the anomaly class. Testing set. 512 images from the normal class and

512 images from the anomaly class.

III. ANOMALY DETECTION FRAMEWORK In this section, we will describe our anomaly detection

framework. Although our approach is quite general we will focus on components suitable for images.

Our system performs the following steps given a set of p

training images x1,...,x p :

1) Learn a feature encoding, of the training images, using an unsupervised sparse feed-forward neural network auto-encoder.

2) Extract features for each image using the trained sparse auto-encoder.

3) Train a one-class non-linear Radial Basis Function ν -Support Vector Machine (RBF SVM) [14], [15] classifier to predict the label, normal or anomalous, given the computed features.

Next, we will go on to describing the steps of our system in finer detail.

A single-layered auto-encoder is a type of feed-forward artificial neural network with one hidden layer. An auto-encoder is trained to reconstruct its input signal by finding useful features from the input space. The auto-encoder learns a map from input to representation, where the representation consists of the activations of the m hidden layer units. Concretely, given an input x ∈ Rn , the auto-encoder computes an output y ∈ Rn , via a hidden layer

representation f ∈ Rm . The hidden layer activations are computed from the input according to 1 1( ) ( )f g= +x Wx b , and the output layer from the hidden layer according to y = g(W2 f (x) + b2 ) . Where W1 ∈ Rm×n and W2 ∈ Rn×m are

weight matrices, b1 ∈ Rm and b2 ∈ Rn are bias vectors, and g(z) =1/ (1+ exp(−z)) is our chosen activation function applied to the vector z component-wise.

Auto-encoders apply back-propagation, for training W1,W2 ,b1 , and b2 , by means of gradient descent, in an attempt to achieve y ≈ x for the training data. If the hidden layer has dimensionality not less than the input and output layers ( m ≥ n ) then it is trivial for the auto-encoder to succeed in exactly reproducing its inputs, and nothing of use is learnt. However, if the hidden layer activations are

encouraged to be sparse then its units are forced to learn significant structures within the data. Imposing sparsity, we have a minimisation problem of the form:

minimise xi − yi2 + β KL ρ ρ̂ j( )

j=1

m

∑i=1

p

∑⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪, (1)

where KL ρ ρ̂ j( ) = ρ ln ρ / ρ̂ j( ) + 1− ρ( ) ln 1− ρ( ) / 1− ρ̂ j( )( ) is

our sparsity penalty term, ρ̂ j is the average activation over

the whole training set for hidden unit j , ρ is our (desired)

sparsity level parameter which we constrain ρ̂ j to

approximate, and β is used to control the weight of the sparsity penalty term.

TABLE I: SPARSE AUTO-ENCODER HYPERPARAMETER VALUES

Ftd Fdt Mtt Mdt Mtd Mdd m 576 144 392 392 98 98 ρ 0.1 0.5 0.5 0.5 0.25 0.25 β 4 16 4 4 4 4

λ 10−1 10−3 10−5 10−5 10−5 10−5

Evidently, there are several hyperparameters associated

with the simple single-layered sparse auto-encoders we employ: m (number of units in the hidden layer, excluding the bias unit), λ (weight decay) used during back-propagation, ρ (sparsity level), and β (weight of sparsity penalty term), which were all set to reasonable values based on preliminary results for each test problem. For reproducibility our chosen set of hyperparameter values,

m,ρ,β ,λ{ }, are listed in Table I for each of the six anomaly

detection test problems.

We use our sparse auto-encoder to compute several types of features for use in anomaly detection:

Input (INP): x , (2)

Hidden Representation (HR): f (x) , (3)

Scalar Residual Magnitude (SRM): x − y1, (4)

Signed Residual (R): x − y , (5)

Absolute Residual (AR): x − y , (6)

Squared Residual (SR): x − y( )2, (7)

Normalised Signed Residual (NR): x − y( ) σ , and (8)

Normalised Squared Residual (NSR): x − y( )2σ

2

, (9)

where σ = (1/ p) xk − yk( )2

k=1

p∑ is the vector of root-

mean-square residuals between input and output across the training set. All of the features set out above are vectors of

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

23

A. Unsupervised Sparse Auto-Encoder Feature Learning

B. Features

Page 4: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

dimension m , except for the scalar residual magnitude feature.

Consider our set of training samples x1,...,x p and

suppose that these samples are drawn from a probability distribution P , in the feature space. We apply a non-linear one-class classification algorithm, namely a Radial Basis Function ν -Support Vector Machine, [14], [15] in an attempt to estimate the support of this distribution.

This one-class formulation, of the standard two-class SVM procedure, first transforms the feature vector via a non-linear RBF kernel, where the origin is viewed as the sole member of the unknown second class. The one-class SVM gives a function h that outputs +1 in a region that encompasses most of the training samples, and outputs −1 everywhere else.

The objective function to separate the training samples from the origin of our one-class RBF SVM classifier is the following quadratic programming minimisation task:

minω,ξi ,q12

ω 2 + 1ν p

ξi − qi=1

p

∑ (10)

subject to: ω ⋅φ xi( )( ) ≥ q −ξi,∀i =1,..., p, (11)

ξi ≥ 0,∀i =1,..., p, (12)

where φ is a kernel mapping of x i into a dot product space F , q is a bias term, and ν ∈ (0,1) is an upper bound on the fraction of the training samples that are considered to be out-of-class and a lower bound on the fraction of training samples used as Support Vectors (SVs).

Our decision rule, having solved the quadratic programming minimisation problem using Lagrange multipliers, is:

h x( ) = sgn αi ⋅ K x, xi( )i=1

p

∑ − q⎛

⎝⎜

⎠⎟ , (13)

where the αi are the Lagrange multipliers, and

K xi, x j( ) = exp −γ ⋅ xi − x j2( ),γ > 0, (14)

is the RBF kernel with width parameter γ .

There is no clear means in which to differentiate between alternative one-class RBF SVM models, in the case of classification accuracy ties, when testing on the validation set. Therefore, we employ an ensemble composed of the winning models and take the majority output as the class label. Formally, our procedure for selecting hyperparameter values for ν and γ is as follows: 1) Sample without replacement 1024 normal samples

from the training set. 2) Sample 512 normal samples from the training set and

512 anomalous samples from the validation set.

3) Perform a grid-search over the hyperparameters, and then select all (in the case of ties) hyperparameter tuples that give rise to the highest classification accuracy, for the samples in Step 2.

4) Go back to Step 1 and repeat this process twice more. We now have a set of hyperparameter tuples ν ,γ{ } . We

will train an ensemble of one-class RBF SVMs (using the set of tuples ν ,γ{ } ), on the entire training set of 2048

samples, such that they are now ready to predict the labels of unseen test samples. We combine the label outputs of each one-class RBF SVM, in the ensemble, by taking the majority output as the predicted label of an unseen test sample.

To be clear, we have made use of an anomaly set in order to select hyperparameter values for our one-class RBF SVMs. We emphasise that this is a deficiency within our system, however we do not use the anomaly set to choose features or a two-class feature space boundary. We aim to remove the need for anomalous samples, for classifier hyperparameter optimisation, in future work.

For this study, we utilised the readily available LIBSVM toolbox (Version 3.20) [16], which is an integrated piece of software with built-in distribution estimation (one-class SVM).

IV. EXPERIMENTS AND ANALYSIS Using our anomaly detection framework outlined in

Section III, the experiments that are reported in this paper are as follows: 1) We began by performing a comparison of the features,

(2)-(9), across all six test problems specified in Section III.A. From this it will be possible to evaluate which of the features are better suited as representations for the concept of normality. Furthermore, we may assess whether anomalous samples, do in some cases, give rise to abnormal auto-encoder features but normal residuals, whilst others have abnormal residuals but normal auto-encoder features.

2) Lastly, we performed a comparison of two different ways of combining the best performing feature vectors, namely:

• Pre-classifier feature vector concatenation where we combined the best feature vectors into a single new feature vector.

• Post-classifier fusion whereby a different sparse auto-encoder is trained on each of the chosen feature vectors separately and then we combined the results of the various one-class RBF SVMs to determine whether a sample is anomalous or not.

Our experiments first considered the usefulness of the features: (2)-(9). Table II shows the classification accuracy of each ensemble of one-class RBF SVMs, across the six test problems, on the unseen testing sets described in Section II.C. There are several notable observations that can be taken from Table II: 1) The hidden representation is on average the best

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

24

C. One-Class RBF SVM Classification

A. Comparison of Features

Page 5: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

performing feature vector across all six test problems, and most notably it outperforms the raw input signal.

2) The best performing residual is the normalised signed residual vector, most importantly showing itself to be better than the scalar residual magnitude feature.

3) The normalised signed residual vector has superiority over the hidden representation when the normal class is tight or the anomaly class is diverse, exemplified in the comparison of their scores for Ftd with Fdt. For Ftd the hidden representation scores 94.43% while the normalised signed residual vector scores 98.24% ; for Fdt the scores are 92.99% and 73.34% , respectively.

for comparison of features

Having established that input features (2), hidden layer features (3), and reconstruction errors—encoded as normalised signed residuals (8)—are all sometimes effective, but each on different problems, we then considered the most effective means of combining them. We compared pre-classifier feature vector concatenation and post-classifier fusion using a combination of these three features.

1) Pre-Classifier Feature Vector Concatenation (PRE) 1) Create combined feature vectors for the training data

comprised of the features derived from (2), (3), and (8)

by concatenation. 2) Scale feature dimensions so that each has zero-mean

and unit-variance across the training set. 3) Retrieve the predicted class labels for the testing set

given by the ensemble of one-class RBF SVMs trained on normal data, with the combined feature vectors, and take the mode of the predictions, for each sample, as the label.

2) Post-Classifier Fusion (POST): 1) Retrieve the predicted class labels for the testing set

given by the ensemble of one-class RBF SVMs trained on normal data with feature vectors of the form (2). Then compute the mean label, for each sample, given the predictions from the ensemble.

2) Repeat Step 1, but this time for feature vectors of the form (3).

3) Repeat Step 1, but this time for feature vectors of the form (8).

4) Compute the mean of the labels given by Steps 1--3, then label a sample as normal if the output is positive, otherwise label as anomalous. The results in Table III show post-classifier fusion to be

superior to using any of the stand-alone feature vectors: (2), (3), or (8). In contrast, pre-classifier feature vector concatenation is on average the worst performing approach, receiving the lowest classification accuracy.

TABLE II: TEST CLASSIFICATION ACCURACY (%) ON THE TEST PROBLEMS (BOLD INDICATES THE HIGHEST ACCURACY WITHIN A ROW)

Test Problem INP HR SRM R AR SR NR NSR Ftd 98.24 94.43 95.02 98.24 98.93 98.54 98.24 99.22 Fdt 80.37 92.99 81.15 55.08 48.14 50.78 73.34 48.44 Mtt 91.11 91.21 84.96 86.91 88.77 83.59 91.80 90.43 Mdt 86.23 85.06 70.70 79.69 81.35 78.71 83.40 80.08 Mtd 85.74 88.77 87.60 50.00 50.00 86.91 86.33 83.01 Mdd 78.52 75.29 80.66 50.00 50.00 70.12 75.59 74.80

Average 86.70 87.96 83.35 69.99 69.53 78.11 84.78 79.33

TABLE III: TEST CLASSIFICATION ACCURACY (%) ON THE TEST PROBLEMS (BOLD INDICATES THE HIGHEST CLASSIFICATION ACCURACY WITHIN A ROW)

FOR FEATURE VECTOR COMBINATION

Test Problem INP HR SRM NR PRE POST Ftd 98.24 94.43 95.02 98.24 98.24 97.66 Fdt 80.37 92.99 81.15 73.34 51.86 89.55 Mtt 91.11 91.21 84.96 91.80 92.19 92.29 Mdt 86.23 85.06 70.70 83.40 82.91 87.01 Mtd 85.74 88.77 87.60 86.33 86.82 89.75 Mdd 78.52 75.29 80.66 75.59 72.95 78.22

Average 86.70 87.96 83.35 84.78 80.83 89.08

V. SUMMARY In this empirical study, on the detection of anomalous

data using auto-encoders, we have conducted several experiments on a range of anomaly detection test problems. Our experiments compared a selection of different feature vectors derived from a sparse auto-encoder feed-forward neural network. The empirical results appear to support our hypothesis that there is indeed a better way to use residual errors than simply computing the magnitude, and this is most apparent when the normalised signed residual is employed. Furthermore, the results suggest that the hidden layer representation, as a stand-alone feature vector, is more than capable of characterising the fundamental attributes of

normality. Its competence is shown through it having the highest average recognition rate amongst the stand-alone feature vectors. The robustness of the hidden layer representation is best illustrated in situations where the auto-encoder does not struggle to reconstruct anomalous images, and so gives rise to low residuals. For instance, in Fdt, where we have a diverse normal class of high-complexity non-empty cargo containers and an anomaly class of empty cargo containers. In this test problem the auto-encoder is able to reconstruct the low-complexity empty cargo containers, despite having never been trained to do so. However, the units of the hidden representation are activated in an abnormal fashion, and as such we are able to identify a greater number of anomalous images, as opposed to using

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

25

B. Feature Vector Combination

Page 6: Detecting An omalous Data Using Auto-Encoders · Fi co sa on re Th di m fe m hi co le in w va ve co 1) 2) 3) us fr an w of ca ba m F g. 1). That is mplexity im mples. A criticism

thcoswaldicocoapreefusbethth

anclwhya reun

in

[1

[2

[3

[4

[5

he residuals toonverse test prwapped. We sll the featureifficult to enontainer imagontainers, wppearance. Noesidual, the hidffectively whesing a post-claenefit from thhat is, some ahose constituen

Our aim in nomaly detectlass. We note

we make use yperparametervalidation se

emove in futnsupervised sy

J. T. A. Anvaluable supp

] Y. Bengio, Areview and nMachine Inte

] N. Japkowicapproach to con Artificial I

] J. M. Ko, J. “Studies of vibridges in HEngineering a

] L. Manevitz neural netwo2007.

] H. Sohn, K. changing env

o make a prroblem, Ftd, w

see in this case vectors, sinncode a moge having b

which are onetheless, thdden represenen all three arassifier fusion

heir individualanomalies mants. this work w

tion without that we devia of an anomrs for one-clas

et is not idealture work soystem.

ACKNOW

Andrews thankport and assist

REFE

A. Courville, and new perspectives,lligence, vol. 35,

cz, C. Myers, aclassification,” inIntelligence, 1995M., Y. Q. Ni, J. ibration-based daHong Kong,” inand Technologicaand M. Yousef,

orks,” Neurocomp

Worden, and Cvironmental con

ediction. Thiswhere the clase a performannce the autoore-complex been trained

relatively hhe use of the ntation and there used in tandn. By doing thl anomaly detay only be a

was to considethe knowledgated from thismaly validatss RBF SVMsl, it is a com

o as to move

WLEDGMENT ks Emma F.tance in review

ERENCES P. Vincent, “Rep” IEEE Trans. ono. 8, pp. 1798-1

and M. Gluck, n Proc. Internati5, pp. 518-523.

Y. Wang, Z. G.amage detection on Proc. Internaal Sciences, Chin“One-class docu

puting, vol. 70,

C. R. Farrar, “Nnditions,” in Pr

s differs fromss labels havence increase ao-encoder fin

non-empty con empty c

homogeneousnormalised si

e input signal dem and combhis, we are abection capabi

apparent in on

er the problege of the anos aim at one sion set to ss. Whilst the u

mponent we aie towards a

. Shapiro forwing this work

presentation learnn Pattern Analys1828, March 2013“A novelty det

ional Joint Conf

. Sun, and X. T. of three cable-supational Conferena, 2000, pp. 105-

ument classificatino. 7, pp. 1466

ovelty detection roc. SPIE 8th A

m the been

across nds it cargo cargo s in igned work bined ble to lities, ne of

em of omaly stage: select use of im to truly

r her k.

ning: A sis and 3. tection ference

Zhou, pported nce on -112. ion via 6-1481,

under Annual

JemMLoapUCotovi

erone T. A. Amathematics from MRes in security

ondon, UK. He ispplied mathematK, jointly supomputer Science

opics of interest ision, similarity le

Andrews has anKing’s College L

y science from s currently workintics at Universitypervised by thee and Statistical are anomaly detearning, and dens

n MSci (Hons) London, UK andUniversity Collng towards a PhDy College Londe Departments Science. His m

tection in compusity estimation.

in d an ege

D in don,

of main uter

International Journal of Machine Learning and Computing, Vol. 6, No. 1, February 2016

26

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

International Sympp. 108-118. R. J. Streifel, R. “Detection of shrotors using novTransactions on1996. C. Surace, K. Wapproach to diagSPIE, vol. 3089,C. Surace and Kdamage in structuEighth Internatio1998, vol. 4, pp. K. Worden, “StJournal of SoundA. Saxe, P. W. K“On random wei28th Internationa1096. A. Coates, A. Ynetworks in unsConference on A223. J. Ba and R. CAdvances in Neu2014. � Y. LeCun, C. Cdatabase of http://yann.lecunB. Schölkopf, R.C. Platt, “Suppo12, pp. 582-588,B. Schölkopf, J.Williamson, “Edistribution,” Ne2001. C. C. Chang, anmachines,” ACTechnology, vol.

mposium on Sma

J. Marks II, M. Ahorted-turns in thvelty detectors-d

n Energy Conver

Worden, and G.gnose damage inpp. 947-953, 199

K. Worden, “A noures: an applicational Conference 64–70.

tructural fault ded and Vibration, vKoh, Z. Chen, Mghts and unsuperal Conference on

Y. Ng, and H. supervised featurArtificial Intellig

Caruana, “Do deural Information

Cortes, and C. Jhandwritten

n.com/exdb/mnist. C. Williamson, ort vector method

1999. . C. Platt, J. ShaEstimating the eural Computatio

nd C. J. Lin, “LIBCM Transaction

2, no. 3, pp. 27,

art Structures an

A. El-Sharkawi, ahe field winding odevelopment andrsion, vol. 11, n

. Tomlinson, “An a cracked beam97. ovelty detection ion to an offshoreof Off-shore and

etection using avol. 201, no. 1, pp

M. Bhand, B. Surrvised feature lean Machine Learni

Lee, “An analyre learning,” in

gence and Statist

eep nets really Processing Syste

J. C. Burges. (1digits. [On

t/ A. J. Smola, J. Sd for novelty det

awe-Taylor, A. Jsupport of a

on, vol. 13, no.

BSVM: A libraryns on Intellig2011.

nd Materials, 20

and I. Kerszenbauof turbine-generad field test,” IEno. 2, pp. 312-3

A novelty detectm,” Proceedings

method to diagne platform,” in Prd Polar Engineeri

novelty measurp. 85-101, 1997. resh, and A. Y. Nrning,” in Proc. ing, 2011, pp. 10

ysis of single-laProc. Internatio

tics, 2011, pp. 2

need to be deepems, pp. 2654-26

1998). The MNInline]. Availab

Shawe-Taylor, andtection,” NIPS, v

J. Smola, and R.a high-dimensio

7, pp. 1443-14

y for support vecgent Systems a

001,

um, ator

EEE 317,

tion s of

nose roc. ing,

re,”

Ng, the 89-

ayer onal 15-

p?” 662,

IST ble:

d J. vol.

C. onal 471,

ctor and