An Evolutionary Method for Training Autoencoders for Deep Learning Networks

University of Missouri, Department of Computer ScienceUniversity of Missouri, Informatics Institute

Sean Lander, Master’s Candidate

An Evolutionary Method for Training Autoencoders for Deep Learning NetworksMASTER’S THESIS DEFENSE

SEAN LANDER

ADVISOR: YI SHANG

Sean Lander, Master’s CandidateUniversity of Missouri, Department of Computer Science

AgendaoOverviewoBackground and Related WorkoMethodsoPerformance and TestingoResultsoConclusion and Future Work


OverviewDeep Learning classification/reconstructionoSince 2006, Deep Learning Networks (DLNs) have changed the landscape of classification problemsoStrong ability to create and utilize abstract featuresoEasily lends itself to GPU and distributed systemsoDoes not require labeled data – VERY IMPORTANToCan be used for feature reduction and classification


OverviewProblem and proposed solutionoProblems with DLNs:oCostly to train with large data sets or high feature spacesoLocal minima systemic with Artificial Neural NetworksoHyper-parameters must be hand selected

oProposed Solutions:oEvolutionary based approach with local search phaseo Increased chance of global minimumoOptimizes structure based on abstracted featuresoData partitions based on population size (large data only)oReduced training timeoReduced chances of overfitting


BackgroundPerceptronsoStarted with Perceptron in 1950oOnly capable of linear separabilityoFailed on XOR


BackgroundArtificial Neural Networks (ANNs)oANNs went out of favor until the Multilayer Perceptron (MLP) introducedoPro: Non-linear classificationoCon: Time consuming

oAdvance in training: BackpropagationoIncreased training speedsoLimited to shallow networksoError propagation diminishes anumber of layers increase


BackgroundBackpropagation using Gradient DescentoProposed in 1988, based on classification erroroGiven m training samples:

oFor each sample where calculate its error:

oFor all m training samples the total error can be calculated as:


BackgroundDeep Learning Networks (DLNs)oAllows for deep networks with multiple layersoLayers pre-trained using unlabeled dataoLayers are “stacked” and fine tunedoMinimizes error degradation for deepneural networks (many layers)

oStill costly to trainoManual selection of hyper-parametersoLocal, not global, minimum


BackgroundAutoencoders for reconstructionoAutoencoders can be used forfeature reduction and clusteringo“Classification error” is the abilityto reconstruct the sample inputoAbstracted features – output fromthe hidden layer – can be used toreplace raw input for othertechniques


Related WorkEvolutionary and genetic ANNsoFirst use of Genetic Algorithms (GAs) in 1989oTwo layer ANN on a small data setoTested multiple types of chromosomal encodings and mutation types

oLate 1990s and early 2000s introduced other techniquesoMulti-level mutations and mutation priorityoAddition of local search in each generationoInclusion of hyper-parameters as part of the mutationoIssue of competing conventions starts to appearo Two ANNs produce the same results by sharing the same nodes but in a permuted order


Related WorkHyper-parameter selection for DLNsoMajority of the work explored using newer technologies and methods such as GPU and distributed (MapReduce) trainingoImproved versions of Backpropagation, such as Conjugated Gradient or Limited Memory BFGS were tested under different conditionsoMost conclusions pointed toward manual parameter selection via trial-and-error


Method 1Evolutionary Autoencoder (EvoAE)oIDEA: Autoencoders’ power are in their feature abstraction, the hidden node outputoTraining many AEs willmake more potentialabstracted featuresoBest AEs will contain thebest featuresoJoining these featuresshould create a better AE


Method 1Evolutionary Autoencoder (EvoAE)

x x’

A3

A4

A1

A2

h x

B3

C2

B1

B2

h

Initi

aliza

tion

Loca

l Sea

rch

x’

Cros

sove

rM

utati

on


Method 1ADistributed learning and Mini-batchesoTraining of generic EvoAE increases in time linearly to the size of the populationoANN training time increases drastically with data sizeoTo combat this, mini-batches can be used where each AE is trained against a batch and updatedoBatch size << total data


Method 1ADistributed learning and Mini-batchesoEvoAE lends itself to distributed systemoData duplication and storage now an issue due to data duplication

Train• Forward propagation• Backpropagation

Rank• Calculate error• Sort

GA• Crossover• Mutate

Batch 1

Batch 2

…

Batch N


Method 2EvoAE Evo-batchesoIDEA: When data is large, small batches can be representativeoPrevents overfitting as nodes being trained are almost always introduced to new dataoScales well with large amounts of data even when parallel training is not possibleoWorks well on limited memory systems by increasing size of the population, thus reducing data per batchoQuick training of large populations, equivalent to training a single autoencoder using traditional methods


Method 2EvoAE Evo-batches

Data A Data B Data C Data D

Data D

Data C

Data B

Data A

Original Data

Local SearchCrossoverMutate


Performance and TestingHardware and testing parametersoLenovo Y500 laptopoIntel i7 3rd generation 2.4GHzo12 GB RAM

oAll weights randomly initialized to N(0,0.5)Parameter Wine Iris Heart Disease MNIST

Hidden Size 32 32 12 200

Hidden Std Dev NULL NULL NULL 80

Hidden +/- 16 16 6 NULL

Mutation Rate 0.1 0.1 0.1 0.1

Parameter Defaults

Learning Rate 0.1

Momentum 2

Weight Decay 0.003

Population Size 30

Generations 50

Epochs/Gen 20

Train/Validate 80/20


Performance and TestingBaseline

Learning rate Learning rate * 0.1

oBaseline is a single AE with 30 random initializationsoTwo learning rates to create two baseline measurementsoBase learning rateoLearning rate * 0.1


Performance and TestingData partitioningoThree data partitioning methods were usedoFull dataoMini-batchoEvo-batch

Full data Mini-batch Evo-batch


Performance and TestingPost-training configurationsoPost-training run in the following waysoFull data (All)oBatch data (Batch)oNone

Full data Batch data None

All sets below are using the Evo-batch configuration


ResultsParameters ReviewParameter Wine MNIST

Hidden Size 32 200

Hidden Std Dev NULL 80

Hidden +/- 16 NULL

Mutation Rate 0.1 0.1

Parameter Defaults

Learning Rate 0.1

Momentum 2

Weight Decay 0.003

Population Size 30

Generations 50

Epochs/Gen 20

Train/Validate 80/20


ResultsDatasetsoUCI wine dataseto178 sampleso13 featureso3 classesoReduced MNIST dataseto6000/1000 and 24k/6k training/testing sampleso784 featureso10 classes (0-9)


ResultsSmall datasets - UCI Wine

Parameter Wine

Hidden Size 32

Hidden Std Dev NULL

Hidden +/- 16

Mutation Rate 0.1


ResultsSmall datasets - UCI WineoBest error-to-speed:oBaseline 1

oBest overall error:oFull data All

oFull data is fast onsmall scale dataoEvo- and mini-batchnot good on smallscale data

Parameter Wine

Hidden Size 32

Hidden Std Dev NULL

Hidden +/- 16

Mutation Rate 0.1


ResultsSmall datasets – MNIST 6k/1k

Parameter MNIST

Hidden Size 200

Hidden Std Dev 80

Hidden +/- NULL

Mutation Rate 0.1


ResultsSmall datasets – MNIST 6k/1koBest error-to-time:oMini-batch None

oBest overall error:oMini-batch Batch

oFull data slowsexponentially onlarge scale dataoEvo- and mini-batchclose to baseline speed

Parameter MNIST

Hidden Size 200

Hidden Std Dev 80

Hidden +/- NULL

Mutation Rate 0.1


ResultsMedium datasets – MNIST 24k/6k

Parameter MNIST

Hidden Size 200

Hidden Std Dev 80

Hidden +/- NULL

Mutation Rate 0.1


ResultsMedium datasets – MNIST 24k/6koBest error-to-time:oEvo-batch None

oBest overall error:oEvo-batch Batch ORoMini-batch Batch

oFull data too slow torun on datasetoEvoAE w/ population30 trains as quickly asa single baseline AEwhen using Evo-batch

Parameter MNIST

Hidden Size 200

Hidden Std Dev 80

Hidden +/- NULL

Mutation Rate 0.1


ConclusionsGood for large problemsoTraditional methods are still preferred choice for small problems and toy problemsoEvoAE with Evo-batch produces effective and efficient feature reduction given a large volume of dataoEvoAE is robust against poorly-chosen hyper-parameters, specifically learning rate


Future WorkoImmediate goals:oTransition to distributed system, MapReduce based or otherwiseoHarness GPU technology for increased speeds (~50% in some

cases)

oLong term goals:oOpen the system for use by novices and non-programmersoMake the system easy to use and transparent to the user for both

modification and training purposes


Thank you


BackgroundBackpropagation with weight decayoWe use this new cost to update weights and biases given some learning rate α:

oCost is prone to overfitting - weight decay variable λ is added


BackgroundConjugated Gradient DescentoThis can become stuck in a loop, however, so we add a momentum term β

oThis adds memory to the equation, as we use previous updates


BackgroundArchitecture and hyper-parametersoArchitecture and hyper-parameter selection usually done through trial-and-erroroManually optimized and updated by handoDynamic learning rates can beimplemented to correct forsub-optimal learning rate selection


ResultsSmall datasets – UCI IrisoThe UCI Iris dataset has 150 samples with 4 features and 3 classesoBest error-to-speed:oBaseline 1

oBest overall error:oFull data None

Parameter Iris

Hidden Size 32

Hidden Std Dev NULL

Hidden +/- 16

Mutation Rate 0.1


ResultsSmall datasets – UCI Heart DiseaseoThe UCI Heart Disease dataset has 297 samples with 13 features and 5 classesoBest error-to-time:oBaseline 1

oBest overall error:oFull data None

Parameter Heart Disease

Hidden Size 12

Hidden Std Dev NULL

Hidden +/- 6

Mutation Rate 0.1

An Evolutionary Method for Training Autoencoders for Deep Learning Networks

Documents

Transcript of An Evolutionary Method for Training Autoencoders for Deep Learning Networks