Plato’s Objects - A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

Platos Objects

A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

In this short article we report on an experiment in handwritten character recognition. Unlike statistical and other purely quantitative approaches, such as neural networks (deep or otherwise!), our experiment, which achieved over 97% accuracy on handwritten digits and over 96% accuracy on handwritten English characters, uses basic similarity measures and was inspired by the idea of Plato's Forms. Prompted by the high accuracy of our simplistic approach (few hundred lines of code!) we conclude with a few comments on the predominant machine learning approaches that are based solely on quantitative (data) analysis and trying to find patterns using an enormous amount of training instances.

The Basic Idea

Platos theory of forms, which undoubtedly has its critics, is quite

involved and is certainly not the subject of this article. However, if we

ignore the big metaphysical questions, the basic idea behind the theory

of Forms is quite simple. Plato suggests that all objects (physical or

abstract) are actually imperfect instances of some ideal and perfect

concept. For example, there can be no perfect box in actual existence

that satisfies all the properties of boxhood (or of all the properties of a

perfect box) each instance, even if it does so slightly, must violate

some property. These ideals (or perfect archetypes) are what Plato

calls the forms.

This basic idea of an ideal concept that has an infinite number of imperfect instances can perhaps be

successfully employed in handwritten character recognition. For example, if an ideal character for a

handwritten p can be attained, then one would expect this ideal to match well with most (if not all)

imperfect instances of p (one should also expect the match between all imperfect instances of p and

the ideal p to be much better than their match with any other ideal).

But what does that ideal handwritten p look like? Well, reverse-engineering might be one way to

create that ideal (or set of ideals). We could, for example, create that sought after ideal by some

All our knowledge begins with the senses, proceeds

then to the understanding, and ends with reason. There

is nothing higher than reason.

Immanuel Kant

WALID SABA

June 22, 2014 Copyright 2014, Walid Saba

(composition) function of a number of actual instances, and there are a number of composition

functions that one can think of to achieve this goal. Viewing images as sets (sets of rows, where each

row is a set of numbers), one can compose images using a combination of set-theoretic functions

(such as union, intersection, difference, etc.) as well as a range of number-theoretic functions (such as

the arithmetic or geometric mean, the fuzzy min and max operations, etc.) Having created an ideal

for every object, test examples are then matched against these ideals using basic similarity functions.

Surprisingly, this very simple approach has resulted in very high recognition rates.

Below we describe the experiment we conducted in some detail and in doing so we concentrate on

handwritten (decimal) digits.

The Experiment

A STRAIGHTFORWARD COMPOSITION FUNCTION TO CREATE IDEALS

In the first experiment, namely that of handwritten digit recognition, we used the MNIST database of

handwritten digits provided by LeCunn et. al. (available here). The dataset consists of a training set of

60,000 examples, and a test set of 10,000 examples, although we only needed a small portion of the

training set examples 1000 examples of each digit, for a total of 10000 examples.

The initial experiment was quite simple: an ideal for each digit was created using a composition

function of 1000 examples for each digit. The initial composition function was the straightforward

average (mean) function. Taking each instance in the example set as a vector of integers (in the range

of 0 to 255, corresponding to the intensities of the pixels in the image), a function that will compose all

1000 instances of a digit into an ideal will compute the value of a (ideal) pixel at (i,j) as the average

values at (i,j) of all 1000 examples. Some pruning can also be applied, by putting some threshold to

remove noise: for example, if the total average value is below a certain threshold, one can white-out

this value, as it is probably a value produced by accumulated noise. This did improve things slightly,

but was not in the end a critical factor. Shown in figure 1 below are the ideals of the digits 0 through 9,

as well as the ideals of 4 and 6, along with some samples of the instances used to create them.

Fig. 1 Platos digits and a sample of instances of 4 and 6 and their corresponding (composed) ideal

After creating 10 ideals, corresponding to the ten digits 0 through 9, 30000 examples were later

matched against these ideals using a simple cosine similarity measure (which is derived from the

Euclidean dot product and Euclidean distance). This simple procedure of creating some average ideal

and matching unseen examples using the cosine similarity resulted in 82% recognition accuracy.

Although it is not a very high accuracy rate, this result was actually encouraging, considering the very

simple procedure. This result in fact prompted us to extend the concept of ideals a bit further.

MORE THAN ONE IDEAL?

Creating an ideal digit using the mean is straightforward and quite intuitive. However, there are other

possible composition functions that can be employed. For example, the final pixel value at (i,j) can be

computed using the geometric mean of all corresponding (i,j) pairs in the 1000 examples. Other

possible composition functions are minimum, maximum, product, or a combination of these (for

example, a minimum of all maximums, or vice versa). It turns out that different composition functions

captured different aspects of an ideal: for example, using the product (and minimum) in computing an

ideal for digit 6, for example, produced an ideal that has just the bare minimum of what might barely

look like a 6. This is useful, since it will force other instances to match well with the absolute

minimum. In the final analysis, however, it turned out that using a number of ideals was optimal.

Figure 2. Results (in red) of applying 6 composition functions on 52 instances of the letter p.

Example instances are shown in black.

In figure 2 we show the result of applying six different composition functions on the letter p along with

some sample instances (the set of English characters was obtained from UC Irvine Machine Learning

Repository, available here). It is worth mentioning here that an interesting pattern was found during

the composition of instances into an ideal character. It turns out that at some point including more

instances in the composition function did not change the resulting ideal by much. This might of course

be an accident of the instances involved, which, in our experiment, were chosen randomly.

Nevertheless, we think the fact that it seems that there is a quick conversion into an ideal after a few

instances is rather interesting and we will briefly discuss this issue in our concluding remarks.

With a number of ideals for each digit (or English character), we can now score example instances

more accurately, by scoring an instance digit against all ideals of all the digits. The intuition here is as

follows: a digit that matches (most of) the ideals of 6, for example, and does so more than it matches

the ideals of any other digit, is quite likely to be a 6. Using only 4 different ideals for each digit (and

character), and matching unseen examples using two common similarity functions (cosine similarity,

and the Pearson Correlation Coefficient) 98% (and 97%) recognition accuracy were obtained.

Can this Paradigm Scale to Other Images?

Encouraged by the results obtained in handwritten digit and character recognition, we thought of

applying this very simple approach to images of physical objects. Ideals in this case must of course

take into consideration some important attributes that real-life physical objects have, such as

orientation, size, color, etc. Some of these attributes can be handled by the algorithm itself. For

example color can be abstracted by creating ideals from and then matching on the grayscale of the

actual images. Size can also be handled by the algorithm itself, by scaling images to a pre-defined

size. Ideals for different orientations must however be created. Fortunately, there are few orientations

for each physical object. In figure 3 below we show some ideals crated using the arithmetic mean, the

geometric mean, and a fuzzy intersection composition functions. In the latter, equality was tested

using different thresholds pixel values that are within 16 units were assumed to be equal (16 is

roughly the square root of the highest possible value of a pixel which is 255).

Figure 3. Examples of six ideals of bus images created using 20 instances.

Shown in figure 3 are six ideals of bus images created from 20 instances (the data set, which contains

1000 images, is provided by professor Wang at PennState University, and is available here.) When

matching unseen images, an idealized image of these unseen examples is also created, Ideals for

unseen examples were created by performing small transformations and rotations. Creating an ideal

of an unseen example before matching effectively amounts to trying to view a new and unseen image

from slightly different angles and vantage points. In the end, each instances ideal was matched with

the six ideals of every object and an overall score was computed. Again, this simple approach resulted

in over 90% precision, but recall was at 70% only.

End of the Experiment (and Beginning of the Debate)

Our experiment ended here, mainly because we (at klangoo) are not really in the business of image

recognition. This experiment was actually triggered by the barrage of articles in the mainstream media

proclaiming that intelligent machines are about to creep up in every corner and that some new and

amazing technology is about to create true artificial intelligence. Apparently, this hype was due to the

successful application of a new technique (deep learning, which is effectively stacked and iterative

learning by neural neural networks), where using thousands of processors and after training on

millions of images, good results were apparently obtained in image (cat!) recognition.

Paradoxically, however, and while these techniques are purportedly inspired by biological models, it

seems that in looking at the details these techniques seem to ignore many biological facts. For one

thing, it seems biologically implausible that one would need millions of training examples to recognize

cats (or any other object). Surely, children know what a cat looks like by seeing a handful of examples,

and none of us waited to see hundreds of thousands of a handwritten p to start correctly recognizing

and writing a p.

The need to process millions of instances in training a neural network is thus in complete

disagreement with biological data that seems to suggest that children acquire concepts and learn the

meaning of new words after a few examples, sometimes even just a couple, or in some instances just

one instance! For example, recent experiments have suggested that, after hearing a phrase such as

John ate the klangoo, children pointed to a food item when asked to point to klangoo in a set of

images that included one object that can clearly be identified as a food object. Apparently, using a

single training example children seem to be acquiring the meaning of a word they never encountered

before using some form of semantic constraints (or selectional restrictions) namely that the object of

a verb such as eat must be some kind of food. Semantic constraints are not only used in

understanding the meaning of new words, but also in image (object) recognition, as reported by a

recent study.

Another biological implausibility these models seem to have is due to the fact that many species that

have a very small fraction of the number of neurons that humans have (mice, ants, etc.) are born with

almost immediate sound and image recognition abilities. Therefore, the enormous amount of

computational power these models seem to need to learn a simple concept, also make these models

suspect, to say the least.

The reason these models require the processing of an enormous amount of data to learn a single

concept is due to the fact that these (purely quantitative) models operate only on data, and therefore

can never be the right paradigm for high-level reasoning where computing over information structures

and knowledge, and not just data, is required. Failing to recognize the difference between data,

information and knowledge, is fatal in artificial (and natural) intelligence.

Data vs. Information (vs. Knowledge)

So why operating on data only and failing to recognize the difference between data, information and

knowledge, will simply not scale into anything called intelligence? This is a potentially quite involved

subject and can get quite technical, but we will attempt to provide the answer to this question here

using an informal and simple argument.

In grade school we were told (amongst other things) that = 16. But is this correct? Lets see what

taking this equality for granted would result in:

A: = 16

B: My 5-year old daughter knows that 7 + 9 = 16

A is a fact, by virtue of what my grade school teacher told me. B is also a fact (and it is true because it

is!) my 5-year old daughter actually does know that adding 7 to 9 amounts to 16. But if A is true we

should be able to replace 16 in B by a value that is equal to it, including . Doing this results in the

following, which is not true:

C: My 5-year old daughter knows that 7 + 9 =

What went wrong? How could we conclude from two facts (two true statements) something that is not

true? Were we taught the wrong things in grade school? Well, not exactly, but we were not told the

whole story. While is equal to 16, these two objects are equal in one aspect (one attribute) only,

namely their value that is, they are both instances of the concept 16. But as objects they are in

general distinct as they differ in many other respects, and equating them in all contexts is fatal (as we

saw, it may lead to a contradiction or, it may lead to invalid inferences/conclusions).

So, as it turns out, there is another and more important equality that we were not told about in grade

school (in any case, I doubt my algebra teacher knew about the other equality anyway). But now that

we are grown-ups doing artificial intelligence (and talking deep stuff!) we should be well aware of that

other equality. In mathematical logic that equality is known as intensional equality (as opposed to

extensional equality the one we all know), and it is at the heart of the difference between data and

information (or, equivalently, between objects and concepts) and is related to abstraction the

fundamental mechanism behind learning1.

(a) (b)

Figure 4. Objects vs. Concepts (equal as concepts, distinct as objects)

As objects, the instances in 4a and 4b are all distinct in that they differ in the values of a number of

their attributes (age, color, size, etc.) However, the instances in 4a are the same (equal!) in another

aspect, namely that they are dogs (they are all instances of the concept of a dog). This (other)

equality was reached at by abstracting from data to information (we ignored the specific values of a

number of attributes). Abstraction of course occurs at various levels. For example, abstracting now

over information makes all the objects in 4a and 4b also equal in one respect, namely that they are all

animals (as opposed to trees, or rocks, etc.)

Abstracting over data (into information) is also not enough in performing high-level reasoning, but one

is required to represent and reason with knowledge. Arguing for this third level of reasoning cannot be

done here. However, to appreciate the implication of this point I would simply pose this question: how

can an image recognition system that correctly identified a certain image to be the image of a dog,

infer if the dog in the image is alive or dead? Surely data (and information) alone are not enough - and

one must resort to some rules (knowledge) e.g., a dog that seems to be running is not a dead dog!

1 INtensional equality is concerned with looking INternally at the details of the objects and their attribute values, while

EXtensional equality is concerned with the EXternal properties and ignoring the details thus, while your cat and mine are distinct objects, looking at them externally (from the outside), they are both cats. Similarly, looking at your cat and mine as well as some cheetah in Nairobi, they are all felines, etc.

Over two hundred years ago, Immanuel Kant, largely considered to be one of the most influential

thinkers of modern philosophy (logic, metaphysics, etc.) suggested the following: All our knowledge

begins with the senses, proceeds then to the understanding, and ends with reason. There is nothing

higher than reason. Roughly speaking, data corresponds to the senses (that is what our senses do,

they collect data!), understanding corresponds to information (which is obtained by imposing a

meaningful structure on data), and reason corresponds to knowledge. Thus, claiming we are about to

have artificial intelligence by operating on the lowest (the first) level only and succeeding (as well as, if

not less than ants and mice!) in recognizing images and sounds is surely absurd, and stubbornly

sticking to this paradigm will only result in wasting a lot of time and resources.

I conclude with a plea: before we herald the coming of artificial intelligence because of successfully

processing large amounts of data, it would be wise to revisit the foundational science that was

developed by some of the most brilliant minds in logic, linguistics, metaphysics, philosophy,

psychology and mathematics. The over-zealous and apparently misguided journalists can wait a bit

before they write headlines that are several years ahead of their time.

Walid Saba

Currently the CIO of klangoo, Walid Saba has over 20 years of experience in

information technology, holding various positions at such places as the

American Institutes for Research, AT&T Bell Labs, Metlife, Nortel Networks,

IBM and Cognos. He has also spent 7 years in academia where he taught

computer science at the University of Ottawa, the New Jersey Institute of

Technology, the University of Windsor, the American University of Beirut, and

the American University of Technology. He has published over 35 technical

articles, including an award winning paper that was presented in Germany at

KI-2008. Walid holds an MSc in Computer Science from the University of

Windsor, and a PhD in Computer Science from Carleton University.

Plato’s Objects - A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

Documents

Transcript of Plato’s Objects - A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION