Plato’s Objects - A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

7
 Plato s Objects A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION In this short article we report on an experiment in handwritten character recognition. Unlike statistical and other purely quantitative approaches, such as neural networks (deep or otherwise!), our experiment, which achieved over 97% accuracy on handwritten digits and over 96% ac curacy on handwritten English characters, uses basic similarity measures and was inspired by the idea of Plato's Forms. Prompted by the high accuracy of our simplistic approach (few hundred lines of code!) we conclude with a few comments on the predominant machine learning approaches that are based solely on quantitative (data) analysis and trying to find patterns using an enormous amount of training instances. The Basic Idea Plato’s theory of forms, which undoubtedly has its critics, is quite involved and is certainly not the subject of this article. However, if we ignore the big metaphysical questions, the basic idea behind the theory of Forms is quite simple. Plato suggests that all objects (physical or abstract) are actually imperfect instances of some ideal and perfect concept. For example, there can be no perfect box in actual existence that satisfies all the properties of ‘boxhood’ (or of all the properties of a perfect box)   each instance, even if it does so slightly, must violate some property. These ideals (or perfect archetypes) are what Plato calls ‘the forms’. This basic idea of an ideal concept that has an infinite number of imperfect instances can perhaps be successfully employed in handwritten character recognition. For example, if an ‘ideal’ character for a handwritten ‘p’ can be attained, then one would expect this ideal to match well with most (if not all) imperfect instances of ‘p’ (one should also expect the match between all imperfect instances of ‘p’ and the ideal ‘p’ to be much better than their match with any other ideal).  But what does that ideal handwritten ‘p’ look like? Well, reverse-engineering might be one way to create that ideal (or set of ideals). We could, for example, create that sought after ideal by some  All our knowledge begins with the senses, proceeds then to the understanding, and ends with r eason. There is nothing higher than reason. Immanuel K ant  WALID SABA  June 22, 2014 Copyright © 2014, Walid Saba

description

In this short article we report on an experiment in handwritten character recognition. Unlike statistical and other purely quantitative approaches, such as neural networks (deep or otherwise!), our experiment, which achieved over 97% accuracy on handwritten digits and over 96% accuracy on handwritten English characters, uses basic similarity measures and was inspired by the idea of Plato's Forms. Prompted by the high accuracy of our simplistic approach (few hundred lines of code!) we conclude with a few comments on the predominant machine learning approaches that are based solely on quantitative (data) analysis and trying to find patterns using an enormous amount of training instances.

Transcript of Plato’s Objects - A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

  • Platos Objects

    A TELLING EXPERIMENT IN HANDWRITTEN CHARACTER RECOGNITION

    In this short article we report on an experiment in handwritten character recognition. Unlike statistical and other purely quantitative approaches, such as neural networks (deep or otherwise!), our experiment, which achieved over 97% accuracy on handwritten digits and over 96% accuracy on handwritten English characters, uses basic similarity measures and was inspired by the idea of Plato's Forms. Prompted by the high accuracy of our simplistic approach (few hundred lines of code!) we conclude with a few comments on the predominant machine learning approaches that are based solely on quantitative (data) analysis and trying to find patterns using an enormous amount of training instances.

    The Basic Idea

    Platos theory of forms, which undoubtedly has its critics, is quite

    involved and is certainly not the subject of this article. However, if we

    ignore the big metaphysical questions, the basic idea behind the theory

    of Forms is quite simple. Plato suggests that all objects (physical or

    abstract) are actually imperfect instances of some ideal and perfect

    concept. For example, there can be no perfect box in actual existence

    that satisfies all the properties of boxhood (or of all the properties of a

    perfect box) each instance, even if it does so slightly, must violate

    some property. These ideals (or perfect archetypes) are what Plato

    calls the forms.

    This basic idea of an ideal concept that has an infinite number of imperfect instances can perhaps be

    successfully employed in handwritten character recognition. For example, if an ideal character for a

    handwritten p can be attained, then one would expect this ideal to match well with most (if not all)

    imperfect instances of p (one should also expect the match between all imperfect instances of p and

    the ideal p to be much better than their match with any other ideal).

    But what does that ideal handwritten p look like? Well, reverse-engineering might be one way to

    create that ideal (or set of ideals). We could, for example, create that sought after ideal by some

    All our knowledge begins with the senses, proceeds

    then to the understanding, and ends with reason. There

    is nothing higher than reason.

    Immanuel Kant

    WALID SABA

    June 22, 2014 Copyright 2014, Walid Saba

  • (composition) function of a number of actual instances, and there are a number of composition

    functions that one can think of to achieve this goal. Viewing images as sets (sets of rows, where each

    row is a set of numbers), one can compose images using a combination of set-theoretic functions

    (such as union, intersection, difference, etc.) as well as a range of number-theoretic functions (such as

    the arithmetic or geometric mean, the fuzzy min and max operations, etc.) Having created an ideal

    for every object, test examples are then matched against these ideals using basic similarity functions.

    Surprisingly, this very simple approach has resulted in very high recognition rates.

    Below we describe the experiment we conducted in some detail and in doing so we concentrate on

    handwritten (decimal) digits.

    The Experiment

    A STRAIGHTFORWARD COMPOSITION FUNCTION TO CREATE IDEALS

    In the first experiment, namely that of handwritten digit recognition, we used the MNIST database of

    handwritten digits provided by LeCunn et. al. (available here). The dataset consists of a training set of

    60,000 examples, and a test set of 10,000 examples, although we only needed a small portion of the

    training set examples 1000 examples of each digit, for a total of 10000 examples.

    The initial experiment was quite simple: an ideal for each digit was created using a composition

    function of 1000 examples for each digit. The initial composition function was the straightforward

    average (mean) function. Taking each instance in the example set as a vector of integers (in the range

    of 0 to 255, corresponding to the intensities of the pixels in the image), a function that will compose all

    1000 instances of a digit into an ideal will compute the value of a (ideal) pixel at (i,j) as the average

    values at (i,j) of all 1000 examples. Some pruning can also be applied, by putting some threshold to

    remove noise: for example, if the total average value is below a certain threshold, one can white-out

    this value, as it is probably a value produced by accumulated noise. This did improve things slightly,

    but was not in the end a critical factor. Shown in figure 1 below are the ideals of the digits 0 through 9,

    as well as the ideals of 4 and 6, along with some samples of the instances used to create them.

    Fig. 1 Platos digits and a sample of instances of 4 and 6 and their corresponding (composed) ideal

    After creating 10 ideals, corresponding to the ten digits 0 through 9, 30000 examples were later

    matched against these ideals using a simple cosine similarity measure (which is derived from the

    Euclidean dot product and Euclidean distance). This simple procedure of creating some average ideal

    and matching unseen examples using the cosine similarity resulted in 82% recognition accuracy.

  • Although it is not a very high accuracy rate, this result was actually encouraging, considering the very

    simple procedure. This result in fact prompted us to extend the concept of ideals a bit further.

    MORE THAN ONE IDEAL?

    Creating an ideal digit using the mean is straightforward and quite intuitive. However, there are other

    possible composition functions that can be employed. For example, the final pixel value at (i,j) can be

    computed using the geometric mean of all corresponding (i,j) pairs in the 1000 examples. Other

    possible composition functions are minimum, maximum, product, or a combination of these (for

    example, a minimum of all maximums, or vice versa). It turns out that different composition functions

    captured different aspects of an ideal: for example, using the product (and minimum) in computing an

    ideal for digit 6, for example, produced an ideal that has just the bare minimum of what might barely

    look like a 6. This is useful, since it will force other instances to match well with the absolute

    minimum. In the final analysis, however, it turned out that using a number of ideals was optimal.

    Figure 2. Results (in red) of applying 6 composition functions on 52 instances of the letter p.

    Example instances are shown in black.

    In figure 2 we show the result of applying six different composition functions on the letter p along with

    some sample instances (the set of English characters was obtained from UC Irvine Machine Learning

    Repository, available here). It is worth mentioning here that an interesting pattern was found during

    the composition of instances into an ideal character. It turns out that at some point including more

    instances in the composition function did not change the resulting ideal by much. This might of course

    be an accident of the instances involved, which, in our experiment, were chosen randomly.

    Nevertheless, we think the fact that it seems that there is a quick conversion into an ideal after a few

    instances is rather interesting and we will briefly discuss this issue in our concluding remarks.

    With a number of ideals for each digit (or English character), we can now score example instances

    more accurately, by scoring an instance digit against all ideals of all the digits. The intuition here is as

    follows: a digit that matches (most of) the ideals of 6, for example, and does so more than it matches

    the ideals of any other digit, is quite likely to be a 6. Using only 4 different ideals for each digit (and

    character), and matching unseen examples using two common similarity functions (cosine similarity,

    and the Pearson Correlation Coefficient) 98% (and 97%) recognition accuracy were obtained.

  • Can this Paradigm Scale to Other Images?

    Encouraged by the results obtained in handwritten digit and character recognition, we thought of

    applying this very simple approach to images of physical objects. Ideals in this case must of course

    take into consideration some important attributes that real-life physical objects have, such as

    orientation, size, color, etc. Some of these attributes can be handled by the algorithm itself. For

    example color can be abstracted by creating ideals from and then matching on the grayscale of the

    actual images. Size can also be handled by the algorithm itself, by scaling images to a pre-defined

    size. Ideals for different orientations must however be created. Fortunately, there are few orientations

    for each physical object. In figure 3 below we show some ideals crated using the arithmetic mean, the

    geometric mean, and a fuzzy intersection composition functions. In the latter, equality was tested

    using different thresholds pixel values that are within 16 units were assumed to be equal (16 is

    roughly the square root of the highest possible value of a pixel which is 255).

    Figure 3. Examples of six ideals of bus images created using 20 instances.

    Shown in figure 3 are six ideals of bus images created from 20 instances (the data set, which contains

    1000 images, is provided by professor Wang at PennState University, and is available here.) When

    matching unseen images, an idealized image of these unseen examples is also created, Ideals for

    unseen examples were created by performing small transformations and rotations. Creating an ideal

    of an unseen example before matching effectively amounts to trying to view a new and unseen image

    from slightly different angles and vantage points. In the end, each instances ideal was matched with

    the six ideals of every object and an overall score was computed. Again, this simple approach resulted

    in over 90% precision, but recall was at 70% only.

    End of the Experiment (and Beginning of the Debate)

    Our experiment ended here, mainly because we (at klangoo) are not really in the business of image

    recognition. This experiment was actually triggered by the barrage of articles in the mainstream media

    proclaiming that intelligent machines are about to creep up in every corner and that some new and

    amazing technology is about to create true artificial intelligence. Apparently, this hype was due to the

    successful application of a new technique (deep learning, which is effectively stacked and iterative

    learning by neural neural networks), where using thousands of processors and after training on

    millions of images, good results were apparently obtained in image (cat!) recognition.

  • Paradoxically, however, and while these techniques are purportedly inspired by biological models, it

    seems that in looking at the details these techniques seem to ignore many biological facts. For one

    thing, it seems biologically implausible that one would need millions of training examples to recognize

    cats (or any other object). Surely, children know what a cat looks like by seeing a handful of examples,

    and none of us waited to see hundreds of thousands of a handwritten p to start correctly recognizing

    and writing a p.

    The need to process millions of instances in training a neural network is thus in complete

    disagreement with biological data that seems to suggest that children acquire concepts and learn the

    meaning of new words after a few examples, sometimes even just a couple, or in some instances just

    one instance! For example, recent experiments have suggested that, after hearing a phrase such as

    John ate the klangoo, children pointed to a food item when asked to point to klangoo in a set of

    images that included one object that can clearly be identified as a food object. Apparently, using a

    single training example children seem to be acquiring the meaning of a word they never encountered

    before using some form of semantic constraints (or selectional restrictions) namely that the object of

    a verb such as eat must be some kind of food. Semantic constraints are not only used in

    understanding the meaning of new words, but also in image (object) recognition, as reported by a

    recent study.

    Another biological implausibility these models seem to have is due to the fact that many species that

    have a very small fraction of the number of neurons that humans have (mice, ants, etc.) are born with

    almost immediate sound and image recognition abilities. Therefore, the enormous amount of

    computational power these models seem to need to learn a simple concept, also make these models

    suspect, to say the least.

    The reason these models require the processing of an enormous amount of data to learn a single

    concept is due to the fact that these (purely quantitative) models operate only on data, and therefore

    can never be the right paradigm for high-level reasoning where computing over information structures

    and knowledge, and not just data, is required. Failing to recognize the difference between data,

    information and knowledge, is fatal in artificial (and natural) intelligence.

    Data vs. Information (vs. Knowledge)

    So why operating on data only and failing to recognize the difference between data, information and

    knowledge, will simply not scale into anything called intelligence? This is a potentially quite involved

    subject and can get quite technical, but we will attempt to provide the answer to this question here

    using an informal and simple argument.

    In grade school we were told (amongst other things) that = 16. But is this correct? Lets see what

    taking this equality for granted would result in:

    A: = 16

    B: My 5-year old daughter knows that 7 + 9 = 16

    A is a fact, by virtue of what my grade school teacher told me. B is also a fact (and it is true because it

    is!) my 5-year old daughter actually does know that adding 7 to 9 amounts to 16. But if A is true we

  • should be able to replace 16 in B by a value that is equal to it, including . Doing this results in the

    following, which is not true:

    C: My 5-year old daughter knows that 7 + 9 =

    What went wrong? How could we conclude from two facts (two true statements) something that is not

    true? Were we taught the wrong things in grade school? Well, not exactly, but we were not told the

    whole story. While is equal to 16, these two objects are equal in one aspect (one attribute) only,

    namely their value that is, they are both instances of the concept 16. But as objects they are in

    general distinct as they differ in many other respects, and equating them in all contexts is fatal (as we

    saw, it may lead to a contradiction or, it may lead to invalid inferences/conclusions).

    So, as it turns out, there is another and more important equality that we were not told about in grade

    school (in any case, I doubt my algebra teacher knew about the other equality anyway). But now that

    we are grown-ups doing artificial intelligence (and talking deep stuff!) we should be well aware of that

    other equality. In mathematical logic that equality is known as intensional equality (as opposed to

    extensional equality the one we all know), and it is at the heart of the difference between data and

    information (or, equivalently, between objects and concepts) and is related to abstraction the

    fundamental mechanism behind learning1.

    (a) (b)

    Figure 4. Objects vs. Concepts (equal as concepts, distinct as objects)

    As objects, the instances in 4a and 4b are all distinct in that they differ in the values of a number of

    their attributes (age, color, size, etc.) However, the instances in 4a are the same (equal!) in another

    aspect, namely that they are dogs (they are all instances of the concept of a dog). This (other)

    equality was reached at by abstracting from data to information (we ignored the specific values of a

    number of attributes). Abstraction of course occurs at various levels. For example, abstracting now

    over information makes all the objects in 4a and 4b also equal in one respect, namely that they are all

    animals (as opposed to trees, or rocks, etc.)

    Abstracting over data (into information) is also not enough in performing high-level reasoning, but one

    is required to represent and reason with knowledge. Arguing for this third level of reasoning cannot be

    done here. However, to appreciate the implication of this point I would simply pose this question: how

    can an image recognition system that correctly identified a certain image to be the image of a dog,

    infer if the dog in the image is alive or dead? Surely data (and information) alone are not enough - and

    one must resort to some rules (knowledge) e.g., a dog that seems to be running is not a dead dog!

    1 INtensional equality is concerned with looking INternally at the details of the objects and their attribute values, while

    EXtensional equality is concerned with the EXternal properties and ignoring the details thus, while your cat and mine are distinct objects, looking at them externally (from the outside), they are both cats. Similarly, looking at your cat and mine as well as some cheetah in Nairobi, they are all felines, etc.

  • Over two hundred years ago, Immanuel Kant, largely considered to be one of the most influential

    thinkers of modern philosophy (logic, metaphysics, etc.) suggested the following: All our knowledge

    begins with the senses, proceeds then to the understanding, and ends with reason. There is nothing

    higher than reason. Roughly speaking, data corresponds to the senses (that is what our senses do,

    they collect data!), understanding corresponds to information (which is obtained by imposing a

    meaningful structure on data), and reason corresponds to knowledge. Thus, claiming we are about to

    have artificial intelligence by operating on the lowest (the first) level only and succeeding (as well as, if

    not less than ants and mice!) in recognizing images and sounds is surely absurd, and stubbornly

    sticking to this paradigm will only result in wasting a lot of time and resources.

    I conclude with a plea: before we herald the coming of artificial intelligence because of successfully

    processing large amounts of data, it would be wise to revisit the foundational science that was

    developed by some of the most brilliant minds in logic, linguistics, metaphysics, philosophy,

    psychology and mathematics. The over-zealous and apparently misguided journalists can wait a bit

    before they write headlines that are several years ahead of their time.

    Walid Saba

    Currently the CIO of klangoo, Walid Saba has over 20 years of experience in

    information technology, holding various positions at such places as the

    American Institutes for Research, AT&T Bell Labs, Metlife, Nortel Networks,

    IBM and Cognos. He has also spent 7 years in academia where he taught

    computer science at the University of Ottawa, the New Jersey Institute of

    Technology, the University of Windsor, the American University of Beirut, and

    the American University of Technology. He has published over 35 technical

    articles, including an award winning paper that was presented in Germany at

    KI-2008. Walid holds an MSc in Computer Science from the University of

    Windsor, and a PhD in Computer Science from Carleton University.