CS5010 Assignment 2 · Web viewSoftmax (Binary Logistic Regression): Figure 2: Formula from...

17
120003762 WORD COUNT: 3998 CS5010 ASSIGNMENT 2 0

Transcript of CS5010 Assignment 2 · Web viewSoftmax (Binary Logistic Regression): Figure 2: Formula from...

CS5010 Assignment 2

0

7

ContentsNeural Networks and Deep Learning:2Zero Hidden Layers:2Multiclass Support Vector Machine (MSVM):2Softmax (Binary Logistic Regression):3Neural Networks (NNs):3Heuristic Interpretation:3Neurons:3Activation Function:4Forward Pass:4Backpropagation (Back Pass):5Stochastic Gradient Descent;6Deep Learning:7Local Minima:8Recurrent Neural Networks (RNN):8RNN neurons:8Forward Pass and Backpropagation with RNN:8RNN Deep Learning:9Games:10Tick-Tack-Toe:10Self-Play:11Chess:11Adding Stochastic Moves: Backgammon:11Appendix:12Interpreting Weights:12RNN & Image Tags:12References13

Neural Networks and Deep Learning:

We follow the methodology of (Karpathy, 2016), using image classification, in introducing neural networks. We start by examining linear classification in order to introduce concepts such as score and loss functions, as well as the principles of backpropagation, which are used in neural networks. Next we examine Feedforward Neural Networks with one hidden layer, before transitioning to a discussion of Deep Learning. Finally, we examine Recurrent Neural Networks.

Zero Hidden Layers:(Karpathy, 2016)

Images are visual representations of multidimensional matrices of pixel intensities. On their own these numbers are meaningless to a machine (with respect to understanding the image). Using score functions we can begin to teach a machine to group images into classes.

A score function can translate entire matrixes into individual numbers; numbers assigned to classes. Linear functions are the most basic type of score function, multiplying input numbers by a series of weights (f(xi,W)=Wxi– where x represents pixel intensities and W weights). As an example, if we were to pass an image of a cat through two score functions, one identifying animals, the other objects, we would get two numbers. Using these scores a machine should be able to identify whether the cat is an animal or object (select max score).

The above example, and the machine’s ability to correctly classify the cat, depends on the weights used in the score function. These weights need to be set in such a way that a higher score results for the animal class than the object class when a cat image is passed into the score function. We can achieve this through an iterative process requiring loss functions and gradient descent.

A loss function tries to quantify our ability to successfully classify inputs using the score function. Focusing on supervised learning[footnoteRef:1] in our example we know that the image of the cat belongs in the animal class. This knowledge allows us to translate a series of scores into a loss. We shall now describe two approaches for calculating a ‘loss’. [1: A form of learning where both input and output are known for a data set]

Multiclass Support Vector Machine (MSVM):

Figure 1: Formula from (Karpathy, 2016); L is the loss, s is the score from the score function, sy is the score of the correct class, sj is the score of the incorrect class, ∆ is a safety margin (the score of the incorrect class must be smaller by more than ∆ for L to be zero).

The above formula describes how we calculate the MSVM loss. Using our example, if the animal class score function for our image generated a score of 10 and the object class score function generated a score of 20, than, assuming a ∆ of 0, MSVM would give a loss of 10.

Softmax (Binary Logistic Regression):

Figure 2: Formula from (Karpathy, 2016); f is the score from the score function, fy is the score of the correct class, fj is the score of the incorrect class. Note, the term in the brackets can be thought of as a probability – eg: the probability that the score function correctly groups our image. Due to the –log, small probabilities translate to large loss values.

Using the same score values as in the MSVM example, we would get a Softmax loss of 4.34.

In both the MSVM and Softmax examples above something is wrong with our score function; the machine would incorrectly group a cat to the object class. Both loss functions generate positive loss values to indicate this[footnoteRef:2] - zero loss implies prefect classification. We can begin teaching machines to better classify images[footnoteRef:3] by adjusting weights in the score function. Specifically, to improve classification, we focus on minimizing the loss by adjusting the weights in the score function for a data set of images. [2: Whilst loss and score numbers are meaningless on their own, this example demonstrates that positive loss values result when an object is incorrectly classified to a wrong group (even if correctly classified the ∆ can generate a positive loss).] [3: In our example into animals and objects]

Neural Networks (NNs):

NNs are an extension of the linear classification (LC) approach detailed above. Whilst a linear classifier must move directly from input to output[footnoteRef:4] NNs intrinsically do not. As an example, a two layer NN could perform a mapping from 1000 to 100 to 2. [4: Eg1: if images consisted of 1000 pixels and our ambition was to classify them into two groups than we would have to have a system moving from 1000 inputs to 2 outputs]

Heuristic Interpretation:

An NN looks to identify 100 features within the image (hidden layer). If a feature is present a neuron fires to the second layer that adds up all neuron signals using specific weight before outputting a score value[footnoteRef:5]. [5: Eg2: feature for eyes in an animal could correspond to one of the 100 neurons in the hidden layer, which if firing would increase the probability of high animal score]

NNs can in general increase the number of successful classifications. Bellow we move to discussing the fundamental components of NNs (neurons), before detailing the process for setting up a NN.

Neurons:

Whilst LC used a basic function to transition between inputs to outputs, NN use numerous neurons. Focusing on feedforward NNs, a neuron can be thought of as:

Figure 3: Images from (Nielsen, 2016) and (Karpathy, 2016) respectively showing the fundamental components of NNs.

Similarly to LC, above initially three inputs x0, x1, and x2 are adjusted using weights w0, w1, and w2. Diverging form LC, the sum then becomes an input in an activation function, whose output progresses in the NN. This allows NN to model non-linear behaviour mimicking biological neurons.

Activation Function:

Activation functions are responsible for distinguishing NNs from LC, and they give rise to numerous neuron types. Nielsen notes that perceptrons were the first neurons developed. Perceptrons acted like NAND gates, firing if the input exceeded a threshold:

Figure 4: from (Nielsen, 2016): b is the threshold, 1 is the output when adjusted input exceeds the threshold.

Since perceptrons were inadequate in allowing weight (‘w’) adjustments, new neurons[footnoteRef:6] were developed. [6: Neurons able to translate small changes in ‘w’ to small changes in output,]

The next neuron’s to be developed were sigmoid neurons. By replacing the boundary condition with a Softmax activation function, sigmoid neurons allowed adequate weight adjustment during NN setup.[footnoteRef:7] [7: As an example, Nielsen uses sigmoid neurons, in developing a NN to classify handwritten numbers, realizing 98% efficiency.]

Nevertheless, since sigmoid neurons suffer from two problems: namely not being zero centred and “saturating and killing gradients”(Karpathy, 2016), Karpathy recommends using Rectified Linear Unit (ReLU)[footnoteRef:8] or Leaky ReLU activation functions. These accelerate the convergence of stochastic gradient descent, and are relatively simple to implement. [8: f(x)=max(0,x)]

Forward Pass:

Borrowing from Neilsen, we shall now describe the forward pass with one image for a NN that recognises handwritten digits with one hidden layer. The visual representation of the NN is:

Figure 5: Image from (Nielsen, 2016): a 2 layer NN with 784 input variables, 15 neurons in the hidden layer, and 10 output neurons.

An image with 780 pixels and one added variable to model the threshold ‘b’ is passed into the NN. The model passes all inputs into each neuron in the hidden layer where they are adjusted by a unit vector of weights (Hadamard product w[footnoteRef:9] ⊙ x). The Hadamard product is then summed for each neuron before being passed through an activation ReLU function, which propagates solely positive values through the remainder of the system. Next, all positive outputs (zero have no continued impact) from the hidden layer are passed into each output neuron. In the output neuron another Hadamard product is taken (Hadamard product w ⊙ x(hidden layer output) ). The sum of the product is found to generate a score value. Note, the score is not passed through an activation function in the output. Instead, we pass the score value through a loss function (Softmax as described in LC) to allow for weight adjustment in the next process. [9: We assume that any “bias offset” (Karpathy, 2016) is incorporate into w. ]

Backpropagation (Back Pass):

Whereas simple gradient descent sufficed to optimize the LC loss function, above this is not practicable due to the presence of a hidden layer. Nevertheless, weight adjustment is still possible in NN through a process called backpropagation. The principle of backpropagation originates from applying the chain rule to gradient descent, resulting in the following four functions:

Figure 6: Equations from (Nielsen, 2016): The first equation provides a formula for calculating the change in the loss function for a given change in adjustable input variable a.w – note ‘a’ stands for the input from previous layers, and we take a.w to be z. Note the function says that the change is equal to the derivative of the loss function ((1 – sigmoid(a)) sigmoid(a) – for the sigmoid/Softmax loss function) with respect to the loss function input variables a, multiplied by the derivative of the loss function input variable a with respect to z. In our forward example this would just be equal to the derivative of the Softmax loss function.The second function extends the first function to the next layer using the chain rule – the derivative of the loss function with respect to the input variable in the hidden layer in our forward example. Specifically it says that the derivative is equal to the previous derivative (output layer in our example), multiplied by the derivative of the activation function at the hidden layer with respect to z, and multiplied by the weights at the previous layer (in our example output layer). The final two functions translate the above derivatives which are with respect to z (a.w), into derivatives with respect to w and b (the only variables that we can change). Note: when defining z we considered the threshold b as part of the unit vector w. Furthermore, it should be noted that what we call x from the input variable, above we identify as a in the final equation.

When combined these equations then yield the following backpropagation formula:

Figure 7: Formula from (Nielsen, 2016): The above function fills in all δ derivatives from the previous four equations until we reach a relationship between the loss/cost function and the original input weights. Note, that this function can be adjusted to relate to any weights (in any layer) in the NN.

The formula above species how a small change in the weights in the input layer will impact the loss/cost function. The formula can be adjusted to relate to any weights at any layer within the NN. This allows us to manipulate the weights within the entire NN, optimizing the loss function, and in the process improving the NN’s ability to classify handwritten digits.

Stochastic Gradient Descent;

The loss function with both LC and NN was computed by taking the average of losses for all input images. Whilst this leads to more accurate gradient estimations, when data sets contain thousands of images it may be favourable to sacrifice accuracy for savings in processing time. Batch estimation offers a means of doing, whereby we divide the data set into batches, which proved a sample estimate for the mean loss. The mean loss may be implemented in backpropagation to more quickly adjust the weights in the NN.

Deep Learning:

Figure 8: Image from (Nielsen, 2016): visual representation of Deep Learning.

Deep Learning refers to NN structures with more than one hidden node. We first examine reasons for implementing deep learning, before detailing some of the complications of using multiple hidden layers when implementing backpropagation.

Cybenko proved that a neural network with one hidden layer can be setup to estimate any function (Cybenko, 1989). Whilst true, both Karpathy and Nielsen argue in favour of implementing multiple layers, and using more complex NNs. In the same way that modelling a function by identifying unique relationships for every input variable x is ill-advised (using dummy variables for every x), trying to model a function using an increasing number of neurons in a 2 layer NN[footnoteRef:10] is inefficient. Elaborating through an example, Hastad found that the number of neurons required to compute the parity of bits increases exponentially in proportion with the problem size (Hastad, 2014). Using multiple hidden layers avoids this computational problem. Whilst merely an example, it justifies why researchers have turned towards deep learning. [10: Resembling a list of AND and NAND operations for every scenario.]

To understand the reason for using multiple hidden layers, we revisit our heuristic description of adjustment weights. Previously we mentioned that by including a hidden layer, images could be dissected into features before being recombined during output classification. We mentioned that a NN with one hidden layer could dissect inputs along the lines of: 1000 to 100 to 2. Deep learning allows us to incorporate more complex features into the classification problem; features that result from combining previously identified characteristics.[footnoteRef:11] We can imagine a partition following: 1000 to 100 to 34 to 2, where the first hidden layer has 100 neurons and the second 34. Karpathy notes that: [11: An example would be moving from corners to entire squares to a final output class. The identification of entire squares given the presence of corners would increase classification efficiency. ]

“[That deep learning] can work better than a single-hidden-layer network is an empirical observation, despite the fact that their representational power is equal.” (Karpathy, 2016)

Nevertheless, Karpathy equally notes that exceeding two hidden layers adds little benefit to classification efficiency. This may stem from the difficulties in setting up multiple hidden layers[footnoteRef:12]. Specifically, traditional backpropagation in NNs with multiple hidden layers effectively adjusts weights in the last layer, but not in all layers.[footnoteRef:13] This reduces the benefits of adding additional layers to NNs. Dropout, and L2 loss adjusters can be used to moderate the size of NNs. [12: As the number of hidden layers increases the process of adjusting weights through backpropagation becomes more difficult.] [13: Neilsen demonstrates that the vanishing gradient problem results in decreasing layer learning speeds as the NN becomes more complex. Specifically, classification efficiency remains roughly constant as significant learning continuous in only the final layer, with minimal weight adjustments in others. ]

Local Minima:

As has been mentioned, adding additional layers increases the complexity of the estimation function. Consequently, as the size of NN increases the number of local minima increases concurrently. Whilst this is an argument in favour of using smaller NN, Nielsen argues that though higher order NNs have more local minima, the local minima encountered outperform those from smaller NNs. This plays in to Karpathy’s recommendation for using 2 hidden layers.

Recurrent Neural Networks (RNN):

RNNs provide a means of dynamically modelling events, allowing us to tackle logical problems, and forecast future variables (see Apendix). Pertaining to forecasting, it is the author’s belief that RNNs could be used to forecast opponents’ future moves[footnoteRef:14] in games. [14: Where the opponent does not strictly follow Min Max rules for move selection]

RNN neurons:

Figure 9: Images from (Li, Karpathy, & Johnson, 2016) and (Karpathy, 2016) respectively: showing a basic RRN neuron following the internal score function Ht = fW(Ht-1, Xt) – where Ht is the RNN output score at time t, and X is the input. Note the score output from the RNN is usually further manipulated using a series of adjustment weights to reach y; the image on the right represents the RNN box on the left.

Focusing on the right image above, we see that RNN neurons differ from previously discussed neurons by their use of Ht-1.The neuron may once again implement a ReLU or Softmax activation function, however, previous outputs now impact present outputs.

Forward Pass and Backpropagation with RNN:

Figure 10: RNN from (Nervana, 2016): showing a RNN layer unrolled into a feedforward representation.

The above follows the same feedforward process as a normal NN, with the sole exception that WRhj is an input in every neuron. We should note that during the entire feedforward process the weights marked WR and WI remain the same. This is important when considering backpropagation through the system to find how weights affect loss/cost values. Since outputs hj have to be stored for every neuron, and large systems (numerous inputs) can reduce gradient estimation accuracy[footnoteRef:15], a technique utilising batches of data is implemented when backpropagating. This allows for more gradual, accurate, and consistent weight adjustment (eg: after every 30 inputs). Bellow we show the formula for RNN backpropagation: [15: Due to the presence of W in the differential of C with respect to W, we can get exploding or vanishing gradients as the data sample size increases. The following function from (Nervana, 2016) captures this relationship:]

Figure 11: Formula from (Nervana, 2016): formula showing the chain rule expanded backpropagation derivative of C with respect to W – how W impacts the loss function after 100 cycles (after 100 input variables are passed through the system).

RNN Deep Learning:

Bellow we show an image of a multilayer RNN:

Figure 12: Image from (Li, Karpathy, & Johnson, 2016): could be a multilayer RNN for autonomous image caption generation.

As both, the problems with deep learning explored under NN, and the principles of backpropagation, apply above, we do not elaborate on the functioning of this system.

Games:

We now move to exploring solutions to two games (chess and backgammon) using NNs. While thus far we have used supervised learning when exploration NNs, we now transition to implementing reinforced learning[footnoteRef:16]. Specifically we concentrate on using ‘self-play’ (SP) to set up our NNs[footnoteRef:17]. Bellow we begin by examining the simple game of tick-tack-toe, focusing on describing Max Min policy determination and the translatability into NNs, before transferring lessons learnt to the more complex game of chess, which incorporates look-ahead search, heuristic approximation, and numerous machine learning techniques like: Monte Carlo simulation and Temporal Difference evaluation. Finally we extend our analysis by incorporating stochastic events into the game; we examine backgammon. [16: Where a system is not provided with model Xs or Ys.] [17: Determine weights in NN.]

Tick-Tack-Toe (TTT):

Due to the small size of the sample state space in tic-tac-toe, a program can be written which identifies maximum value states using tree search and performs actions accordingly. As an example, imagine a machine is tasked with finding the best next move for the following state (next is X):

Intuitively, we know that the next best move, which results in the computer winning the game is in the top right quadrant. To arrive at this same conclusion the computer could sample possible moves, propagating the search until all win, loose, and draw states are identified. Bellow we show a diagram of this:

Figure 13: Image from (Shewchuk, 2008): shows the entire tree of possible state spaces for a game of TTT from the previously shown start state.

Assume we assign a value of one to winning states, zero to draw states, and minus one to loosing states. Next it becomes a problem of identifying a path to the best/max state – this is a greedy policy. Note, however, as TTT is an adversarial game, not all states can be reached. When considering paths, the computer must account for the ‘Human’s’ strategy. In doing this the computer assumes that the ‘Human’ shall equally follow a greedy policy, which translates to a minimum value in the state space diagram. Consequently we can identify possible paths to the current state, and back propagate terminal values to the current state. Concluding, in the diagram, the computer is faced with three choices, with values [-1, 1, -1], and consequently decides to place the X in the top right quadrant. Whilst the above solution did not use NNs or require SP we detailed it to introduce the basic machine learning concepts in adversarial games. These are concepts that we shall later implement when considering chess.

Instead of following the above mentioned approach, imagine if we tried to implement NN solutions and SP. To implement NN we would have to find a way of translating state spaces into scores, and subsequently loss values. This is possible by considering the state space as a unit vector of features – similarly to how your location in a room can be described by your distance to every wall. From these features we can then calculate a score. Whilst previously we had a variable y to compare scores against, in this approach we will have to estimate y. Similarly to A* approaches to search problems, this can be done using heuristics, combined with estimation methods such as Monte Carlo Simulation, or TD(λ). Focusing on TD(0) the following backpropagation formula for adjusting our weights would result:

Figure 14: Formula from (Silver, 2015) using quadratic loss function: the variable highlighted in red represents the estimated state space value (the current reward plus next states estimated value (estimated using the NN)). The other variables correspond with previously highlighted formulas (w- weights, S – current state, S’ – future state, v – score value for a particular state. Note: being an adversarial game we would reach the next (here described as current) state only after both we and our adversary move, and hence weight adjustment would automatically take into account the move strategy being followed by the adversary. Note: the above resembles the Q-learning formula.

Self-Play:

To continue solving the TTT problem we need to play the game, and play with an adversary. By forcing the program to operate as both a player and an opponent we can satisfy both these conditions. SP offers numerous advantages to playing with a human opponent. Principally the program is able to learn more, even whilst maintaining an equal learning rate, because it can simulate significantly more games. Whilst playing with a human can result in more varied strategies (also highlights significantly important strategies[footnoteRef:18]) resulting in more efficient state space exploration, the machine can mimic this by making a random/ exploratory move with probability ρ. This forces state exploration, and rapid optimization of weights. Additionally, by acting as both the player and opponent, the program receives two times the number of inputs, improving optimization even further. [18: This is an important point as it may be worthless to explore for instance 75% of opening moves in e.g. chess. By performing self-play the program may waste its times exploring these states. ]

Chess:

The theory covered for TTT can be directly transferred to chess. Whilst in TTT we had to constrain ourselves, with chess it is easy to imagine ending up in an unexplored state, where back propagation would not have provided an estimate state value. In this case using heuristics, look-ahead (TD(λ)), and our NN approach would provide the best method for determining a move. The heuristics that we could use to determine features in out NN might include: the position of the opponents queen, the position of the opponents king, the difference in points (chess pieces have points) between player and opponent, and etc. We could setup a deep learning NN to further improve state classification.

Adding Stochastic Moves: Backgammon:

Backgammon is a strategic game (inherits solution principles from chess); however it uses die, a stochastic means of determining a player’s possible moves. This creates uncertainty in the system and explains why NN approaches to solving backgammon have concentrated on maximizing the probability of winning when sampling and selecting moves to perform. Whilst it may seem that the uncertain can be ignored when determining which move to select, one must be aware that the probability of winning given a state S, comprises of the sum of probabilities of winning given the opponent’s entire die role space for the current space. In this way the problem resembles a NN incorporating Markov Decision Processes (MDP).

Approaches by Tsinteris and Tesauro have used TD(λ) along with NNs and shallow-depth search to solve the problem (Tsinteris & Wilson, 2001) (Tesauro, 2002). Most famously, TD-Gammon, which incorporated temporal-difference learning and used SP, was able to reach intermediate level proficiency, without using hand-designed features in the NN.

Appendix: Interpreting Weights:

Continuing with images, the weights can either be thought of as a whitewashed generalised image of a specific class (Karpathy describes them as a “template” – see Appendix A), or as weights extracting features from an input.

Considering these interpretations in step, to classify images into groups we could compare them against a database of images compiled for each class. The problem with this approach is, amongst other things, that it is very time intensive. Alternatively, we could combine the images in a database somehow to create a template image against which inputs would be considered. This is similar to our approach and provides one interpretation for the score function weights.

The second interpretation involves imagining the weights as highlighters of features. If for instance objects in our dataset used during the learning process were more defined (square) whilst animals were less defined (rounder, coming in a range of complex shapes) than we would expect corner pixels to have higher weights for the object class. In other words an image with corners would produce a high object score, and thus probably be classed in the object class.

This interpretation is important in understanding the motivation for setting up neural networks.

RNN & Image Tags:

Perhaps most recently, researchers have begun connecting RNNs to Convolutional NNs focusing on image classification. This allows them to add descriptions/tags to categorised images – see Ma et al. 2015.

ReferencesCybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Math. Control Signals Systems, 303 - 3014.Hastad, J. (2014). On the correlation of parity and small-depth circuits. Electronic Colloquim on Computational Complexity, Revision 1 of Report No. 137.Karpathy, A. (2016, November 23). CS231n: Convolutional Neural Networks for Visual Recognition. Retrieved from cs231n.github.io: https://cs231n.github.io/Li, F.-F., Karpathy, A., & Johnson, J. (2016, February 8). Recurrent NEural Networks. Retrieved from CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning, LSTM: https://www.youtube.com/watch?v=iX5V1WpxxkYMa, L., Lu, Z., & Li, H. (2015, November 13). Learning to Answer Questions From Image Using Convolutional Neural Network. Retrieved from arXiv: https://arxiv.org/pdf/1506.00333.pdfNervana. (2016, July 14). Recurrent Neural Networks. Retrieved from Youtube: https://www.youtube.com/watch?v=Ukgii7Yd_cUNielsen, M. (2016, January -). Deep Learning: draft book. Retrieved from neuralnetworksanddeeplearning.com: http://neuralnetworksanddeeplearning.com/chap1.htmlRussell , S., & Norvig, P. (2010). Artificial Intelligence: a Modern Approach. Boston: Pearson.Russell, & Norvig. (2016, October 5). Introduction to Artificial Intelligence . Retrieved from Udacity: https://www.youtube.com/watch?v=DZzffdHNqtQ&list=PLAwxTw4SYaPlqMkzr4xyuD6cXTIgPuzgn&index=374Shewchuk, J. (2008, July 24). CS 61B Lecture 16: Game Trees. Retrieved from UC Berkeley: https://www.youtube.com/watch?v=Unh51VnD-hA&index=16&list=PLD4CAE0D1D2EEF760Silver, D. (2015, - -). Lecture 6: Value Function Approximation. Retrieved from UCL Course on RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.htmlSutton, R., & Barto, A. (2012). Reinforcement Learning: An Introduction. Cambridge Massachusetts : The MIT Press.Tesauro, G. (2002). Programming backgammon using self-teaching neural nets. Artigficial Intelligence: Volume 13, 181-199.Tsinteris, K., & Wilson, D. (2001, September -). TD-learning, neural networks, and backgammon. Retrieved from Cornell: https://www.cs.cornell.edu/boom/2001sp/Tsinteris/gammon.htm