ARitificial Intelligence - Project - Data Classification

PROJECT : IMAGE CLASSIFICATION

December 20, 2015

Name: Mayank

RUID: 165004149

Name: Ramakanth Vemula

RUID: 167004695

Name: Sanjivi Muttena

RUID: 164005979

1

PROBLEM 1 - RESULTS OF EXECUTION OF ALGORITHMS

1) Algorithm 1 : NaiveBayes

All the results are obtained by testing over 1000 testing data points for digits

Table 1: Naive Bayes - Digit - Time

Time Taken For Digit Data(in seconds)Training Data Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 7 Mean SD

500 24.82 24.634 24.629 25.135 24.386 24.937 25.018 24.794 0.2411000 25.822 25.651 25.739 26.052 25.73 25.882 26.273 25.878 0.2001500 26.828 26.472 26.671 26.936 27.284 27.168 26.917 26.896 0.2562000 28.026 27.417 27.821 27.696 27.882 27.948 27.461 27.750 0.2182500 28.353 27.972 28.113 28.271 27.835 27.589 28.032 28.023 0.2403000 29.161 28.528 29.037 28.624 28.883 28.698 29.035 28.852 0.2223500 31.204 30.978 30.472 30.598 31.102 31.092 30.893 30.905 0.2534000 31.864 31.792 31.316 31.927 30.998 31.734 30.727 31.479 0.4364500 32.013 32.232 32.516 31.958 32.163 32.427 32.143 32.207 0.1895000 32.812 32.741 32.583 32.831 32.482 32.912 32.812 32.739 0.141

Figure 1: Naive Bayes Digit Time

2

Table 2: Naive Bayes - Digit - Accuracy

Accuracy For Digit Data(%)Data Point Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Mean SD

500 71.7 71.3 72.1 70.8 71.2 70.9 72.1 71.442 0.4951000 73.5 73.9 74.1 73.1 74.2 72.4 72.7 73.41 0.651500 77.7 75.6 76.4 77.3 75.9 76.2 77.2 76.61 0.732000 75.6 77.3 76.8 77.1 75.9 76.3 76.5 76.5 0.5732500 76.3 77.6 78.2 76.1 76.4 77.7 77.3 77.08 0.753000 77.4 78.4 76.9 76.5 77.2 78.1 76.7 77.31 0.653500 75.9 77.1 76.3 76.8 75.8 77.2 76.2 76.47 0.524000 77.6 77.9 77.4 76 76.9 77.6 77.1 77.21 0.584500 77.1 78.3 77.6 78.4 77.3 78.6 76.9 77.74 0.635000 77 77 77 77 77 77 77 77 0

Figure 2: Naive Bayes Digit Accuracy

3

Table 3: Naive Bayes - Face - Time

Time Taken For Face Data(in seconds)Data Point Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 Time 7 Mean SD

500 5.184 5.211 5.176 5.125 5.094 5.182 5.146 5.159 0.03721000 5.734 5.761 5.627 5.695 5.814 5.703 5.83 5.737 0.0651500 5.974 6.041 6.264 6.103 6.029 6.146 6.042 6.085 0.0882000 6.292 6.462 6.381 6.315 6.402 6.311 6.298 6.351 0.0592500 6.831 6.743 6.796 6.753 6.693 6.731 6.801 6.764 0.0443000 7.104 7.098 7.147 7.183 7.112 7.065 7.15 7.12 0.0363500 7.351 7.401 7.372 7.301 7.391 7.41 7.396 7.374 0.0354000 7.741 7.769 7.809 7.791 7.81 7.739 7.796 7.77 0.0274500 8.09 8.171 8.014 7.995 8.151 8.073 8.102 8.085 0.06015000 8.351 8.406 8.397 8.358 8.382 8.413 8.319 8.375 0.0314

Figure 3: Naive Bayes Face Time

4

Table 4: Naive Bayes- Face- Accuracy

Accuracy For Face Data(%)Data Point Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Mean SD

500 59.9 64.8 58.4 59.3 70.1 65.6 61.6 62.81 3.901000 79.5 78.4 82.1 76.5 80.8 73.4 81.1 78.82 2.8101500 84.6 88.1 80.4 82.9 84.1 83 80.9 83.42 2.382000 86.5 88.9 87.7 87.2 89.7 88.1 87.5 87.94 0.992500 88.6 87.9 90.1 88.9 87.4 89.3 89.8 88.85 0.903000 88.4 88.2 89.8 90.3 87.9 88.4 89.5 88.92 0.853500 90.1 89.7 89.3 90.3 89.8 90.2 88.9 89.75 0.474000 89.4 88.9 89.9 90.7 89.6 90.5 89.6 89.8 0.584500 90.2 90.8 89.7 91.1 90.9 89.8 90.3 90.4 0.505000 91.8 91.8 91.8 91.8 91.8 91.8 91.8 91.8 0

Figure 4: Naive Bayes Face Accuracy

5

Table 5: Perceptron - Digit - Time

Time Taken For Digit Data(in seconds)Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD

500 27.3 21.9 26.4 25.2 2.361000 47.3 46.8 46.34 46.81 0.3921500 70.36 68.2 67.6 68.72 1.182000 97.8 102.7 105.4 101.966 3.1457201972500 130.1 124.1 125.3 126.5 2.5922962793000 146.2 135.5 141.1 140.93 4.3693500 158.7 161.4 166.5 162.2 3.2344000 170.8 183.8 189.15 181.25 7.7054500 184.8 199.5 220.3 201.53 14.5635000 201.4 221.1 215.9 212.8 8.33

Figure 5: Perceptron Digit Time

6

Table 6: Perceptron - Digit - Accuracy

Accuracy For Digit Data(in seconds)Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD

500 67.1 63.1 71.2 67.133 2.8631000 74.5 78.3 72.3 75.033 2.4781500 77.2 79.6 73.8 76.866 2.3792000 80.1 78.1 76.4 78.2 1.5122500 77.3 81.5 79.3 79.36 1.7153000 79.1 76.2 80.4 78.566 1.7553500 79.4 82.1 77.1 79.533 1.354000 81.1 77.2 80.7 79.66 1.954500 83.2 80.2 79.1 80.833 1.7325000 80.1 78.2 81.1 79.8 1.202

Figure 6: Perceptron Digit Accuracy

7

Table 7: Perceptron - Face - Time

Time Taken For Face Data(in seconds)Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD

500 2.4 2.1 2.3 2.266 0.1241000 4.5 4.9 4.8 4.733 0.1691500 7.7 7.6 7.4 7.56 0.1242000 9.7 9.8 10.1 9.866 0.1692500 12.4 12.2 12.7 12.433 0.2053000 15.1 14.9 15.2 15.066 0.1243500 17.3 17.6 17.5 17.466 0.1244000 20.1 20.2 19.2 19.83 0.4494500 22.7 22.6 22.4 22.566 0.1255000 25.6 25.4 25.8 25.6 0.163

Figure 7: Perceptron Face Time

8

Table 8: Perceptron - Face - Accuracy

Accuracy For Face Data(in seconds)Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD

500 67.1 70.4 73 70.16 2.0911000 74.4 75.1 72.1 73.866 1.2811500 77.4 79.6 76 77.666 1.482000 80.1 78.4 81.2 79.9 1.1522500 82.3 81.2 80 81.16 0.9393000 84.2 81.9 84.3 83.1 1.1083500 84.4 85.7 83.5 84.6 0.904000 86.4 89.1 87.4 88.25 1.354500 87.2 88.1 88.5 87.93 0.545000 90.1 89.1 88.9 89.366 0.525

Figure 8: Perceptron Face Accuracy

9

Table 9: Mira - Digit - Time

Time Taken For Digit Data(in seconds)Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD

500 24.3 23.6 24.9 24.26 0.5311000 46.1 47.4 48.2 47.233 0.8651500 67.2 69.8 68.9 68.633 1.0782000 91.7 93.2 94.3 93.06 1.0652500 109.4 114.8 115.8 113.33 2.813000 138.5 144.7 141.9 141.7 2.533500 152.6 158.1 162.2 157.63 3.934000 183.2 186.3 178.1 182.5333333 3.3806639724500 203.8 193.5 199 198.7666667 4.2081930675000 227.3 220.5 233.1 226.966 5.149

Figure 9: Mira Digit Time

10

Table 10: Mira - Digit - Accuracy

Accuracy For Digit Data(in seconds)Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD

500 72.3 76.4 73.1 73.93 1.5361000 74.1 76.2 77.5 75.93 1.401500 75.1 76.4 73.8 75.1 1.0612000 82.3 80 78.1 80.13 1.7172500 79.4 80.3 81.2 80.3 0.7343000 80.1 78.9 79.9 79.633 0.5243500 82 80.4 81.3 81.23 0.84000 80.1 79.5 78.8 79.46 0.34500 81 80.2 82.1 81.1 0.7785000 79.2 80.4 80.8 80.13 0.679

Figure 10: Mira Face Accuracy

11

Table 11: Mira - Face - Time

Time Taken For Face Data(in seconds)Digits training Data Set Size Time 1 Time 2 Time 3 Mean SD

500 5.732 5.845 5.912 5.829 0.074280251000 8.325 8.826 8.967 8.706 0.2751500 11.458 12.241 11.251 11.65 0.4262000 13.696 13.176 14.218 13.696 0.4252500 16.885 16.903 17.48 17.089 0.2763000 18.74 18.539 18.436 18.571 0.1263500 20.561 20.704 20.019 20.428 0.2954000 22.498 22.014 22.328 22.28 0.2004500 24.893 24.481 23.947 24.440 0.3875000 28.115 28.947 28.571 28.544 0.340

Figure 11: Mira Face Time

12

Table 12: Mira - Face -Accuracy

Accuracy For Face Data(in seconds)Digits training Data Set Size Run 1 Run 2 Run 3 Mean SD

500 67.8 66.2 69.3 67.766 1.261000 72.5 74.8 75.9 74.4 1.4161500 76.1 77.8 77.1 77 0.6972000 79.1 78.3 77.9 78.433 0.4982500 81.5 81.8 82.6 81.96 0.4643000 84.1 83.7 83.9 83.9 0.1633500 86.4 86.5 85.9 86.266 0.2624000 86.9 87.1 87 87 0.08164500 89.1 88.7 90.1 89.3 0.5885000 88.9 91.6 88.5 89.66 1.376

Figure 12: Mira Face Accuracy

13

PROBLEM 3 - DISCUSSION OF ALGORITHMS AND RESULTS

(A) Naive Bayes:

A naive Bayes classifier models a joint distribution over a label Y and a set of observed

random variables, or features, {F1,F2, . . .Fn}, using the assumption that the full joint

distribution can be factored as follows (features are conditionally independent given

the label):

P (F1 . . .Fn ,Y ) = P (Y )∏

iP (Fi |Y )

To classify a datum, we can find the most probable label given the feature values for

each pixel, using Bayes theorem:

P (y | f1 . . . fn) = P ( f1 . . . fn |y)P (y)

P ( f1 . . . fn)

= P (y)∏

i = 1mP ( fi |y)

P ( f1 . . . fn)

ar g maxy P (y | f1 . . . fn) = ar g maxyP (y)

∏mi=1 P ( fi |y)

P ( f1 . . . fn)

ar g maxy P (y | f1 . . . fn) = ar g maxy P (y)m∏

i=1P ( fi |y)

Because multiplying many probabilities together often results in underflow, we will

instead compute log probabilities which have the same argmax.

ar g maxy log P (y)m∏

i=1P ( fi |y) = ar g maxy log (P (y, f1 . . . fn))

= ar g maxy log (P (y)+m∑

i=1log P ( fi |y))

Use math.log(), a built-in Python function to compute logarithms.

Parameter Estimation:

Our naive Bayes model has several parameters to estimate. One parameter is the prior

distribution over labels (digits, or face/not-face), P (Y ). We can estimate P (Y ) directly

14

from the training data:

P̂ (y) = c(y)

n

where c(y) is the number of training instances with label y and n is the total number of

training instances. The other parameters to estimate are the conditional probabilities

of our features given each label y: P (Fi |Y = y). We do this for each possible feature

value ( fi ∈ 0,1).

P̂ (Fi = fi |Y = y) = c( fi , y)∑fi

c( fi , y)

where c( fi , y) is the number of times pixel Fi took value fi in the training examples of

label y.

Smoothing: Your current parameter estimates are unsmoothed, that is, you are using

the empirical estimates for the parameters P ( fi |y). These estimates are rarely adequate

in real systems. Minimally, we need to make sure that no parameter ever receives

an estimate of zero, but good smoothing can boost accuracy quite a bit by reducing

overfitting. In this project, we use Laplace smoothing, which adds k counts to every

possible observation value:

P (Fi = fi |Y = y) = c(Fi = fi ,Y = y)+k∑fi

(c(Fi = fi ,Y = y)+k)

If k=0, the probabilities are unsmoothed. As k grows larger, the probabilities are

smoothed more and more. You can use your validation set to determine a good value

for k.

Conclusion: As we increase the training data from 10% to 100% we see that the accu-

racy jumps from a mere 62.8% to a reasonable 91% in face recognition and 71% to 77%

in digit recognition.The trade off is time, more the data - more processing, as from the

data collected we see the difference of 3 seconds in training face data and 8 seconds for

training digit data.

Also we see the increase in accuracy to minimize after some point. In our project

data we see that there is not much difference in accuracy between 50% training data

and 100% training data. So we could improve training time on this by using 50% data

15

instead of training on 100%.

(B) Perceptron:

The perceptron algorithm uses a weight vector to make decisions unlike Native Bayes

which uses probability. The weight vector here is represented by ωy for each class y .

For a given feature list f , the perceptron algorithm computes the class y whose weight

vector is most similar to the input vector f . The feature vector defined in our code is a

map from pixel locations to indicators of whether they are on or not.

Formally, given a feature vector f (in our case, a map from pixel locations to indicators

of whether they are on), we score each class with:

score( f , y) =∑i

fi w yi

The class with highest score is chosen as the predicted label for that data instance

Before classifying the training set, the weights have to be learnt by the algorithm. For

this, the training set is scanned one instance at a time. When we come to an instance

( f , y), we find the label with highest score:

y ′ = arg maxy ′′scor e( f , y ′′)

We compare y ′ i.e the result obtained by previous equations, to the true label y . If the

two labels are equal (y ′ = y) then we’ve gotten the instance correct and can proceed

with the other items in the training set. Otherwise, we guessed a false positive y ′ in

place of y . That means that w y should have scored f higher, and w y ′should have

scored f lower. In order to prevent this error in the future we update these two weight

vectors accordingly.

w y+= f

w y ′−= f

Conclusion: We ran our algorithm for 3 iterations. We see a huge difference in accuracy,

from 67% to 90%(digits) and 70% to 90%(faces). This happens becuase perceptron

iterates repeatedly through the training data. This leads us to higher training time.

16

Perceptron makes weight corrections and starts to improve after each iteration, this is

also the reason behind the significant accuracy rate.

(C) Mira:

Similar to a perceptron classifier, the MIRA classifier also keeps a weight vector w y of

each label y . Here too we scan over the data, one instance at a time. When we come to

an instance ( f , y), we find the label with highest score:

y ′ = arg maxy ′′scor e( f , y ′′)

We compare y ′ to the true label y . If the labels are equal( y ′ = y) , we’ve gotten the

instance correct, and we do nothing. Otherwise, we guessed y ′ but we should have

guessed y . The difference between mira and perceptron is that in Mira we update the

weight vectors of these labels with variable step size:

ωy =ωy +τ f

ωy ′ =ωy ′ −τ f

Here τ> 0 and is chosen such that it minimizes :

minu′(1/2)∑

c||(ω′)c −ωc ||22

subject to the condition that

(ω′)y f ≥ (ω

′)y ′

f +1

This is equivalent to

mi nτ||τ f ||22 subject to τ≥ (ωy ′−ωy ) f +12|| f ||22

and τ≥ 0

We can notice that, ωy ′f ≥ ωy f , so the condition τ ≥ 0 is always true given τ ≥

(ωy ′−ωy ) f +12|| f ||22

Solving the problem we get,

17

τ= (ωy ′ −ωy ) f +1

2|| f ||22We cap the maximum possible value of τ by a positive constant C,

τ= mi n

{C ,τ= (ωy ′ −ωy ) f +1

2|| f ||22

}Conclusion:The accuracy increased from 67% to approx 90% as we increased the

training data. Similar to perceptron the accuracy improves as we the iterations, this

happens because the weights gets updated.Training time is high when compared to

Naive Bayes but is similar to perceptron.

ARitificial Intelligence - Project - Data Classification

Data & Analytics

Transcript of ARitificial Intelligence - Project - Data Classification