PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

22
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015

Transcript of PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Page 1: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKSJoe Bradish

CS5802 Fall 2015

Page 2: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

BASICS OF ARTIFICIAL NEURAL NETWORKSWhat is an Artificial Neural Network (ANN)?

What makes up a neuron?

How is “learning” modelled in ANNs?

Page 3: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

STRUCTURE OF A NEURAL NETWORK

A neural network is a collection of interconnected neurons that compute and generate impulses

Specific parts include neurons, synapses, and activation functions

An artificial neural network is a mathematical model, based on natural neural networks found in animals’ brains.

Page 4: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

BASIC STRUCTURE OF A NEURON

• There is an input vector containing {x1, x2, … , xn} and an associated vector of weights {w1, w2, … , wn}.

• The input x weight vector summation is calculated and the output is sent into an activation function.

• Based on the activation function, the summation is mapped to some value, generally between {-1, 1}, such as in the shown step activation function. This value is then considered the output of the neuron.

Page 5: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

TRAINING A NEURAL NETWORK

To properly train a neural network, the weights must be “tuned” to model the goal function as closely as possible.

“Goal” function represents the function that maps input data to output data in our training set.

Training a neural network is by far the most costly step in the majority of scenarios.

Google has reported training times <2 days for certain problems and network sizes.

Once trained, new items can be classified very quickly though

Some popular options Backpropagation (used in the majority of cases).

genetic algorithms with simulated annealing

Hebbian learning

a combination of different methods in a “Committee of Machines”

Page 6: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

BACKPROPAGATION

Most popular training method

Works by reducing error on the training set Requires many training examples to get error low

Uses gradient descent on the error mean squared error

Partial derivatives are used to determine which neuron/weight to blame for parts of the error

Page 7: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Backward pass is done through backpropagation• Uses chain rule to calculate partial derivative

Page 8: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Underlying operations are embarrassingly parallel, but many problems still remain

Backpropagation, Communication and Computational issues all must be considered when scaling neural networks

Page 9: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

PROBLEMS WITH SCALING BACKPROPAGATION

Requires neurons of one layer to be fully connected to the neurons of the next layer

Lots of communication required

Gradient descent is prone to getting stuck in local optima Requires many iterations to reduce error to acceptable rate

Training data set sizes are very large Rule of thumb for error

Training set size should be roughly the number of weights divided by the permitted classification error rate

10% error rate = 10x the number of weights, 1% = 100x, etc.

Page 10: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

COMPUTATIONAL ISSUES IN SCALING ANNS

Main operation is matrix multiplication N-node layer requires N2 scalar multiplications and N sums of N

numbers

Requires a good multiply or multiply-and-add function

Activation function Often sigmoid is used f(x) = 1/(1+e-x)

Has to be approximated efficiently

Page 11: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

COMMUNICATION ISSUES IN SCALING ANNS

High degree of connectivity

Large data flows

Structure and bandwidth are very important

Broadcasts and ring topologies are often used because of the necessary communication requirements

More processors does not mean faster computation in many cases

Page 12: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

TWO KEY METHODOLOGIES

Model dimension One model, but multiple workers

train individual parts

High amount of communication Need to synchronize at the edges

Efficient when the computation is heavy per neuron

Datasets where each data point contains many attributes

Data Dimension Different workers train on

completely different sets of data

Also high amount of communication Need to synchronize parameters,

weights to ensure consistent model

Efficient when each weight needs a high amount of computation

Large datasets where each data point only contains a few attributes

Page 13: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Example of splitting on the data dimension

Page 14: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

SPANN (SCALABLE PARALLEL ARTIFICIAL NEURAL NETWORK)

Inspired by human brain’s ability to communicate between groups of neurons without fully connected paths

Focused on parallelizing the model dimension

Uses MPI library

Reduces need for communication between every neuron in consecutive layers of a neural network

Only boundary values are communicated between “ghost” neurons

Page 15: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

BIOLOGICAL INSPIRATION

Neocortex is the part of the brain most commonly associated with intelligence

Columnar structure with an estimated 6 layers

Page 16: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

SPANN CONT.

Recall from Serial Backpropagation Parallel Backpropagation

• L is the number of layers, including input/output layers

• Nproc is the number of processors being used

• As shown by the first box, every input is sent to every processor

• Each processor only has Nhidden / Nproc hidden neurons/layers and Nout / Nproc output layers

• Divide by number of processors to get weights/processor

Example comparison of 3 layer network:• Serial ANN

• 200 input, 48 output, 125 hidden• (200+48)*125 = 31,000 weights

need to be trained• Using SPANN in a Parallel ANN

• 200 input, 48 output, 120 hidden• 6 layers, 8 processors• 30,280 weights need to be

trained, but only 3785 per processor

Page 17: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Page 18: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

PERFORMANCE COMPARISON

• 37890 weights on a serial ANN took 1313 seconds to complete training, compared to 30,240 weights taking 842 seconds• There is significant slowdown shown in the serial version

• 8 resolution computes ~36 weights/sec, but 9 resolution falls to only ~28.5 weights/sec

• The time taken per weight grows slower in SPANN, so once the size of the training data reaches a significant size, it becomes much quicker per weight.• Speedup factor is related to the training data size• Larger size, larger speedup

Page 19: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

RESULTS CONT.

Page 20: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

SPANN CONCLUSIONS

Developed an architecture that can scale into billions of weights or synapses

Successful by reducing the communication requirements in between layers to a few “gatekeeper nodes”

Uses a human biological model as inspiration

Page 21: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

SCALING ANNS CONCLUSIONS

• Neural networks are a tool that have provided significant developments in artificial intelligence and machine learning fields

• Scaling issues are big, even though calculations are embarrassingly parallel

• Communication

• Computational

• SPANN showed promising results

• Research continues today

• Heavy focus on communication, as training set sizes are growing faster than the computational requirements in many cases

Page 22: PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

QUESTIONS?