Not another MNIST tutorial with TensorFlow - Deep Learning ......R es heff, and Tom H ope. Bear with...

4/17/2017 Not another MNIST tutorial with TensorFlow O'Reilly Media

https://www.oreilly.com/learning/notanothermnisttutorialwithtensorflow 1/22

Not another MNISTtutorial with TensorFlowAn informative, visual, and interactive MNIST tutorial.By Justin FrancisDecember 6, 2016

Numbers 09. (source: Denise Krebs on Flickr).

For an endtoend guide to TensorFlow, check out Learning TensorFlow, by Itay Lieder, Yehezkel S.Resheff, and Tom Hope.

Bear with me: MNIST is where everyone in machine learningstarts, but I hope this tutorial is different from the others outthere.

Back when TensorFlow was released to the public in November2015, I remember following TensorFlow’s beginner MNISTtutorial. I blindly copied and pasted all this code into my terminaland some numbers popped out as they should have. I thought,OK, I know there is something amazing happening here, whycan I not see it? My goal was to make a MNIST tutorial that wasboth interactive and visual, and hopefully will teach you a thingor two that others just assume you know.

https://www.oreilly.com/people/justin-francis

https://www.flickr.com/photos/mrsdkrebs/9728631593

https://www.safaribooksonline.com/library/view/learning-tensorflow/9781491978504/?utm_source=newsite&utm_medium=content&utm_campaign=lgen&utm_content=not-another-mnist-tutorial-post-top-cta-link

https://en.wikipedia.org/wiki/MNIST_database

https://www.tensorflow.org/versions/r0.11/tutorials/mnist/beginners/index.html



In this tutorial, I will be using the machine learning libraryTensorFlow with Python3 on Ubuntu 14.04. If you need helpinstalling TensorFlow on your own system check out mytutorial here.

If you don't have numpy and matplotlib installed, you’ll needthem. Open a terminal and type in:

$ sudo apt‐get install python‐numpy python3‐numpy python‐matplotlib python3‐

matplotlib

To begin, we will open up python in our terminal and import theMNIST data set:

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

import matplotlib.pyplot as plt

Get O'Reilly's AI newsletter

https://alliseesolutions.wordpress.com/2016/07/05/how-to-install-gpu-tensorflow-0-9-from-sources-ubuntu-14-04/

https://www.oreilly.com/learning/not-another-mnist-tutorial-with-tensorflow



import numpy as np

import random as ran

First, let’s define a couple of functions that will assign theamount of training and test data we will load from the data set.It’s not vital to look very deeply at these unless you want tofigure out what’s going on behind the scenes.

You will need to copy and paste each function and hit entertwice in your terminal:

def TRAIN_SIZE(num):

print ('Total Training Images in Dataset = ' + str(mnist.train.images.shape))

print ('‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐')

x_train = mnist.train.images[:num,:]

print ('x_train Examples Loaded = ' + str(x_train.shape))

y_train = mnist.train.labels[:num,:]

print ('y_train Examples Loaded = ' + str(y_train.shape))

print('')

return x_train, y_train

def TEST_SIZE(num):

print ('Total Test Examples in Dataset = ' + str(mnist.test.images.shape))

print ('‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐')

x_test = mnist.test.images[:num,:]

print ('x_test Examples Loaded = ' + str(x_test.shape))

y_test = mnist.test.labels[:num,:]

print ('y_test Examples Loaded = ' + str(y_test.shape))

return x_test, y_test

And we’ll define some simple functions for resizing anddisplaying the data:

def display_digit(num):

print(y_train[num])



label = y_train[num].argmax(axis=0)

image = x_train[num].reshape([28,28])

plt.title('Example: %d Label: %d' % (num, label))

plt.imshow(image, cmap=plt.get_cmap('gray_r'))

plt.show()

def display_mult_flat(start, stop):

images = x_train[start].reshape([1,784])

for i in range(start+1,stop):

images = np.concatenate((images, x_train[i].reshape([1,784])))

plt.imshow(images, cmap=plt.get_cmap('gray_r'))

plt.show()

Now, we’ll get down to the business of building and training ourmodel. First, we define variables with how many training andtest examples we would like to load. For now, we will load all thedata but we will change this value later on to save resources:

x_train, y_train = TRAIN_SIZE(55000)

Learn more

SAFAR I

Learn faster. Dig deeper. See farther.Join Safari. Get a free trial today and find answerson the fly, or master something new and useful.

https://www.safaribooksonline.com/home/?utm_source=newsite&utm_medium=content&utm_campaign=lgen&utm_content=ai-post-safari-right-rail-cta





Total Training Images in Dataset = (55000, 784)

‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐

x_train Examples Loaded = (55000, 784)

y_train Examples Loaded = (55000, 10)

So, what does this mean? In our data set, there are 55,000examples of handwritten digits from zero to nine. Each exampleis a 28x28 pixel image flattened in an array with 784 valuesrepresenting each pixel’s intensity. The examples need to beflattened for TensorFlow to make sense of the digits linearly.This shows that in x_train we have loaded 55,000 exampleseach with 784 pixels. Our x_train variable is a 55,000 rowand 784 column matrix.

The y_train data is the associated labels for allthe x_train examples. Rather than storing the label as aninteger, it is stored as a 1x10 binary array with the onerepresenting the digit. This is also known as onehot encoding.In the example below, the array represents a 7:

Figure 1. Array representing a 7. Source: Justin Francis.

So, let’s pull up a random image using one of our customfunctions that takes the flattened data, reshapes it, displays theexample, and prints the associated label (note: you have toclose the window matplot opens to continue using Python):

display_digit(ran.randint(0, x_train.shape[0]))



Figure 2. [ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] Source: Justin Francis.

Here is what multiple training examples look like to the classifierin their flattened form. Of course, instead of pixels, our classifiersees values from zero to one representing pixel intensity:

display_mult_flat(0,400)



Figure 3. First 400 training examples. Source: Justin Francis.

Until this point, we actually have not been using TensorFlow atall. The next step is importing TensorFlow and defining oursession. TensorFlow, in a sense, creates a directed acyclicgraph (flow chart) which you later feed with data and run in asession:

import tensorflow as tf

sess = tf.Session()

Next, we can define a placeholder. A placeholder, as the namesuggests, is a variable used to feed data into. The onlyrequirement is that in order to feed data into this variable, weneed to match its shape and type exactly. The TensorFlowwebsite explains that “A placeholder exists solely to serve asthe target of feeds. It is not initialized and contains no data.”Here, we define our x placeholder as the variable to feedour x_train data into:



x = tf.placeholder(tf.float32, shape=[None, 784])

When we assign None to our placeholder, it means theplaceholder can be fed as many examples as you want to giveit. In this case, our placeholder can be fed any multitude of 784sized values.

We then define y_ , which will be used to feed y_train into.This will be used later so we can compare the ground truths toour predictions. We can also think of our labels as classes:

y_ = tf.placeholder(tf.float32, shape=[None, 10])

Next, we will define the weights W and bias b . These twovalues are the grunt workers of the classifier—they will be theonly values we will need to calculate our prediction after theclassifier is trained.

We will first set our weight and bias values to zeros becauseTensorFlow will optimize these values later. Notice how our W isa collection of 784 values for each of the 10 classes:

W = tf.Variable(tf.zeros([784,10]))

b = tf.Variable(tf.zeros([10]))

I like to think of these weights as 10 cheat sheets for eachnumber. This is similar to how a teacher uses a cheat sheettransparency to grade a multiple choice exam. The bias,unfortunately, is a little beyond the scope of this tutorial, but Ilike to think of it as a special relationship with the weights thatinfluences our final answer.



We will now define y , which is our classifier function. Thisparticular classifier is also known as multinomial logisticregression. We make our prediction by multiplying eachflattened digit by our weight and then adding our bias:

y = tf.nn.softmax(tf.matmul(x,W) + b)

First, let’s ignore the softmax and look what's inside the softmaxfunction. Matmul is the function for multiplying matrices. If youknow your matrix multiplication, you would understand that thiscomputes properly and that x * W + b results in a Number ofTraining Examples Fed (m) x Number of Classes (n) matrix.

Figure 4. Simple matrix multiplication. Source: Quartl on Wikimedia Commons.

If you don’t believe me, you can confirm it by evaluating y :

print(y)

https://en.wikipedia.org/wiki/Matrix_multiplication

https://commons.wikimedia.org/wiki/File:Matrix_multiplication_qtl3.svg



Tensor("Softmax:0", shape=(?, 10), dtype=float32)

That tells us what y is in our session, but what if we want thevalues of y ? You cannot just print a TensorFlow graph object toget its values; you must run an appropriate session in which youfeed it data. So, let’s feed our classifier three examples and seewhat it predicts. In order to run a function in our session, we firstmust initialize the variables in our session. Notice if you justrun sess.run(y) TensorFlow will complain that you need tofeed it data:


sess.run(tf.global_variables_initializer())

#If using TensorFlow prior to 0.12 use:

#sess.run(tf.initialize_all_variables())

print(sess.run(y, feed_dict=x: x_train))

[[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]

[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]

[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]]

So, here we can see our prediction for our first three trainingexamples. Of course, our classifier knows nothing at this point,so it outputs an equal 10% probability of our training examplesfor each possible class.

But how did TensorFlow know the probabilities, you might ask; itlearned the probabilities by calculating the softmax of ourresults. The Softmax function takes a set of values and forcestheir sum to equal one, which will give probabilities for eachvalue. Any softmax value will always be greater than zero and

https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html#softmax-regressions



less than one. Still confused? Try running this or read up onwhat softmax is doing mathematically:

sess.run(tf.nn.softmax(tf.zeros([4])))

sess.run(tf.nn.softmax(tf.constant([0.1, 0.005, 2])))

Next, we will create our cross_entropy function, also knownas a loss or cost function. It measures how good (or bad) of ajob we are doing at classifying. The higher the cost, the higherthe level of inaccuracy. It calculates accuracy by comparing thetrue values from y_train to the results of our prediction y foreach example. The goal is to minimize your loss:

cross_entropy = tf.reduce_mean(‐tf.reduce_sum(y_ * tf.log(y), reduction_indices=

[1]))

This function is taking the log of all our predictions y (whosevalues range from 0 to 1) and element wise multiplying by theexample’s true value y_ . If the log function for each value isclose to zero, it will make the value a large negative number(i.e., np.log(0.01) = 4.6), and if it is close to 1, it will make thevalue a small negative number (i.e., np.log(0.99) = 0.1).

https://en.wikipedia.org/wiki/Softmax_function

https://en.wikipedia.org/wiki/Hadamard_product_(matrices)



Figure 5. Y = log (x). Source: Lfahlberg on Wikimedia Commons.

We are essentially penalizing the classifier with a very largenumber if the prediction is confidently incorrect and a very smallnumber if the prediction is confidendently correct.

Here is a simple made up python example of a softmaxprediction that is very confident that the digit is a 3:

j = [0.03, 0.03, 0.01, 0.9, 0.01, 0.01, 0.0025,0.0025, 0.0025, 0.0025]

Let’s create an array label of "3" as a ground truth to compare toour softmax function:

https://commons.wikimedia.org/wiki/File:Logx.svg



k = [0,0,0,1,0,0,0,0,0,0]

Can you guess what value our loss function gives us? Can yousee how the log of “j” would penalize a wrong answer with alarge negative number? Try this to understand:

‐np.log(j)

‐np.multiply(np.log(j),k)

This will return nine zeros and a value of 0.1053, which when allsummed up, we would consider a good prediction. Notice whathappens when we make the same prediction for what is actuallya 2:

k = [0,0,1,0,0,0,0,0,0,0]

np.sum(‐np.multiply(np.log(j),k))

Now, our cross_entropy function gives us 4.6051, whichshows a heavily penalized, poorly made prediction. It washeavily penalized due to the fact the classifier was veryconfident that it was a 3 when it actually was a 2.

Next we begin to train our classifier. In order to train, we have todevelop appropriate values for W and b that will give us thelowest possible loss.

Below is where we can now assign custom variables for trainingif we wish. Any value that is in all caps below is designed to bechanged and messed with. In fact, I encourage it! First, usethese values, then later notice what happens when you use toofew training examples or too high or low of a learning rate.



If you set TRAIN_SIZE to a large number, be prepared to waitfor a while. At any point, you can re run all the code startingfrom here and try different values:


x_test, y_test = TEST_SIZE(10000)

LEARNING_RATE = 0.1

TRAIN_STEPS = 2500

We can now initialize all variables so that they can be used byour TensorFlow graph:

init = tf.global_variables_initializer()

#If using TensorFlow prior to 0.12 use:

#init = tf.initialize_all_variables()

sess.run(init)

Now, we need to train our classifier using gradient descent. Wefirst define our training method and some variables formeasuring our accuracy. The variable training will performthe gradient descent optimizer with a chosen LEARNING_RATE inorder to try to minimize our loss function cross_entropy :

training =

tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Now, we’ll define a loop that repeats TRAIN_STEPS times; foreach loop, it runs training , feeding in valuesfrom x_train and y_train using feed_dict . In order tocalculate accuracy, it will run accuracy to classify the unseen



data in x_test by comparing its y and y_test . It is vitallyimportant that our test data was unseen and not used fortraining data. If a teacher were to give students a practice examand use that same exam for the final exam, you would have avery biased measure of students’ knowledge:

for i in range(TRAIN_STEPS+1):

sess.run(training, feed_dict=x: x_train, y_: y_train)

if i%100 == 0:

print('Training Step:' + str(i) + ' Accuracy = ' +

str(sess.run(accuracy, feed_dict=x: x_test, y_: y_test)) + ' Loss = ' +

str(sess.run(cross_entropy, x: x_train, y_: y_train)))

In order to visualize what gradient descent is doing, you have toimagine the loss as being a 784 dimensional graph basedon y_ and y , which contains different values of x , W ,and b . If you can’t visualise 784 dimensions, that’s to beexpected. I highly recommend Chris Olah’s blog to learn moreamount the dimensions involved with MNIST. To explain thingsmore simply in twodimensions, we will use y = x^2:

http://colah.github.io/posts/2014-10-Visualizing-MNIST/



Figure 6. Parabola y = x^2. Source: Adrignola on Wikimedia Commons.

For each step in the loop, depending on how largethe cross_entropy is the classifier will movea LEARNING_RATE step toward where it thinks cross_entropy ’svalue will be smaller. This lower point is calculated byTensorFlow using the derivative of the cross_entropy , whichgives the slope of the tangent line at a given point. As it movestoward this new point, the values W and b change and the

https://commons.wikimedia.org/wiki/File:Y%3Dx%5E2.svg



slope decreases. As in the case of y = x^2, you can think of thisas moving toward X = 0, which is also called the minimum. If thelearning rate is too small, the classifier will take very small stepswhen learning; if it's too high, the steps it takes will be too large,and it may figuratively “overshoot” the true minimum.

Figure 7. Solid black line shows tangent line at given points. Source: Tosha on Wikimedia Commons.

Notice how near the end, the loss was still decreasing but ouraccuracy slightly went down? This shows that we could stillminimize our loss on our training data, but this may not help uspredict the unseen testing data used for measuring accuracy.This is also known as overfitting (not generalizing). With thedefault settings, I got an accuracy of about 91%. If I wanted tocheat to get 94% accuracy, I could set the test examples to 100.This shows how not having enough test examples can give youa biased sense of accuracy.

Keep in mind this a very ineffective way to calculate ourclassifier. But I did this on purpose for the sake of learning andexperimentation. Ideally, when training with large data sets, youtrain using small batches of training data at a time, not all atonce. If you would like to learn how to do this, follow the tutorialon TensorFlow’s website.

This is my favorite part. Now that we have calculated our weightcheatsheet, we can create a graph with the following code:

for i in range(10):

plt.subplot(2, 5, i+1)

https://commons.wikimedia.org/wiki/File:Parabola-antipodera.gif

https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html



weight = sess.run(W)[:,i]

plt.title(i)

plt.imshow(weight.reshape([28,28]), cmap=plt.get_cmap('seismic'))

frame1 = plt.gca()

frame1.axes.get_xaxis().set_visible(False)

frame1.axes.get_yaxis().set_visible(False)

Now let’s visualize it:

plt.show()

Figure 8. Source: Justin Francis.

This is a visualization of our weights from 09. This is the mostimportant aspect of our classifier. The bulk of the work ofmachine learning is figuring out what the optimal weights are;once they are calculated, you have the “cheat sheet” and caneasily find answers. (This is part of why neural networks can bereadily ported to mobile devices; the model, once trained,doesn’t take up that much room to store or computing power tocalculate.) Our classifier makes its prediction by comparing howsimilar or different the digit is to the red and blue. I like to thinkthe darker the red, the better of a hit; white as neutral; and blueas misses.



So, now that we have our cheat sheet, let’s load one exampleand apply our classifier to that one example:


display_digit(0)

Let’s look at our predictor y :

answer = sess.run(y, feed_dict=x: x_train)

print(answer)

This gives us a (1x10) matrix with each column containing oneprobability:

[[ 2.12480136e‐05 1.16469264e‐05 8.96317810e‐02 1.92015115e‐02

8.20863759e‐04 1.25168199e‐05 3.85381973e‐05 8.53746116e‐01

6.91888575e‐03 2.95970142e‐02]]

But this is not very useful for us. So, we use the argmax functionto return the position of the highest value and that gives us ourprediction.

answer.argmax()

So, let us now take our knowledge to create a function to makepredictions on a random digit in this data set:

def display_compare(num):

# THIS WILL LOAD ONE TRAINING EXAMPLE



x_train = mnist.train.images[num,:].reshape(1,784)

y_train = mnist.train.labels[num,:]

# THIS GETS OUR LABEL AS A INTEGER

label = y_train.argmax()

# THIS GETS OUR PREDICTION AS A INTEGER

prediction = sess.run(y, feed_dict=x: x_train).argmax()

plt.title('Prediction: %d Label: %d' % (prediction, label))

plt.imshow(x_train.reshape([28,28]), cmap=plt.get_cmap('gray_r'))

plt.show()

And now try the function out:

display_compare(ran.randint(0, 55000))

Can you find any that guessed incorrectly? If youenter display_compare(2) , you will find one digit the classifiergot wrong. Why do you think it got it wrong?

This is where this tutorial gets fun: notice what happens to thevisualizations of the weights when you use 110 trainingexamples. It becomes clear that using too little data makes itvery hard to generalize. Here is a animation showing how theweights change as you increase your training size. Can you seewhat is happening?



Figure 9. Source: Justin Francis.

You can also see the limitations of a linear classifier; at a certainpoint, feeding it more data doesn’t help increase your accuracydrastically. What do you think would happen if we tried toclassify a “1” that was drawn on the very left side of the square?It would have a very hard time classifying it because in all of itstraining examples, the 1 was very close to the center.

I hope this helps make you more appreciative to know just howmuch is going on behind the scenes in MNIST. Keep in mindthat this a neural network with two layers; it’s not deep learning.In order to get close to nearperfect accuracy, we have to startthinking convolutionally deep.

If you would prefer to run a more interactive session, here ismy GitHub repo with the Jupyter Notebook version of this. I hada lot of fun writing this and learning along the way. Thanks forreading, and most of all, I really hope something new clicked inyour brain today.

Article image: Numbers 09. (source: Denise Krebs on Flickr).

https://github.com/wagonhelm/NaNmnist

https://www.flickr.com/photos/mrsdkrebs/9728631593



Justin FrancisJustin lives on the west coast of Canada and works on a small farmfocused on permaculture ethics and design. In the past, he was thefounder and educator at a nonprofit Community Cooperative bicycleshop. For the last two years, he lived on a sailboat exploring andexperiencing the Georgia Strait fulltime but is now primarily focused onstudying machine learning.



Not another MNIST tutorial with TensorFlow - Deep Learning ......R es heff, and Tom H ope. Bear with...

Documents

Transcript of Not another MNIST tutorial with TensorFlow - Deep Learning ......R es heff, and Tom H ope. Bear with...