2003 Writer Identification -- Directional Features

download 2003 Writer Identification -- Directional Features

of 5

Transcript of 2003 Writer Identification -- Directional Features

  • 8/14/2019 2003 Writer Identification -- Directional Features

    1/5

    Writer Identication Using Edge-Based Directional Features

    Marius Bulacu Lambert SchomakerAI Institute, Groningen University

    The Netherlands(bulacu, schomaker)@ai.rug.nl

    Louis VuurpijlNICI, Nijmegen University

    The [email protected]

    Abstract

    This paper evaluates the performance of edge-based di-rectional probability distributions as features in writer iden-tication in comparison to a number of non-angular fea-tures. It is noted that the joint probability distribution of theangle combination of two hinged edge fragments outper- forms all other individualfeatures. Combining features mayimprove the performance. Limitations of the method per-tain to the amount of handwritten material needed in order

    to obtain reliable distribution estimates. The global fea-tures treated in this study are sensitive to major style varia-tion (upper- vs lower case), slant, and forged styles, whichnecessitates the use of other features in realistic forensicwriter identication procedures.

    1. Introduction

    In the process of automatic handwriting recognition, in-variant representations are sought which are capable of eliminating variations between different handwritings in or-

    der to classify the shapes of characters and words robustly.The problem of writer identication, on the contrary, re-quires a specic enhancement of the variations which arecharacteristic to a writers hand. At the same time, such rep-resentations or features should, ideally, be independent of the amount and the semantic content of the written material.In the extreme case, only a single word or a signature shouldsufce to identify the writer. Three groups of features canbe identied in forensic writer identication: 1) globalmea-sures, computed automatically on a region of interest (ROI);2) local measures, of layout and spacing features entered byhuman experts, and 3) measures related to individual char-acter shapes. We analyze in this paper only features that are

    automatically extractable from the handwriting image with-out any human intervention. Furthermore, it is assumed thata crisp foreground/backgroundseparation has been realizedin a pre-processing phase, yielding a white background withblack ink. As a rule of thumb, in forensic writer identica-

    tion one strives for 100% recall of the correct writer in a hitlist of 100 writers, computed from a database of more than

    samples. This amount is based on the pragmatic consid-eration that a number of one hundred suspects is just aboutmanageable in criminal investigation. Current systems arenot powerful enough to attain this goal.

    As regards the theoretical foundation of our approach,the process of handwriting consists of a concatenation of ballistic strokes, which arebounded by pointsof high curva-ture in the pen-point trajectory. Curved shapes are realized

    by differential timing of the movementsof the wrist and thenger subsystem [8]. In the spatial domain, a natural cod-ing, therefore, is expressed by angular information along thehandwritten curve [5]. It has long been known [4, 3, 2] thatthe distribution of directions in handwritten traces, as a po-lar plot, yields useful information for writer identication orcoarse writing-style classication. It is the goal of this pa-per to explore the performance of angular-distributiondirec-tional features, relative to a number of other features whichare in actual use in forensic writer-identication systems.

    2. Data

    We evaluated the effectiveness of different features interms of writer identication using the Firemaker dataset [7]. A number of 250 Dutch subjects, predominantlystudents, were required to write four different A4 pages.On page 1 they were asked to copy a text presented in theform of machine print characters. On page 2 they wereasked to describe the content of a given cartoon in theirown words. Pages 3 and 4 of this database contain uppercase and forged-style samples and are not used here. Lin-eation guidelines were used on the response sheets using adropout color, i.e., one that fully reects the light spectrumemitted by the scanner lamp such that is has the same sensed

    luminance as the white background. The added drawback isthat the vertical line distance can no longer be used as a dis-criminatory writer characteristic. The recording conditionswere standardized: the same kind of paper, pen and supportwere used for all the subjects. As a consequence, this also

    oceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)7695-1960-1/03 $17.00 2003 IEEE

  • 8/14/2019 2003 Writer Identification -- Directional Features

    2/5

  • 8/14/2019 2003 Writer Identification -- Directional Features

    3/5

    0

    2

    1

    3456789

    10

    11

    12

    13

    14

    15 16 17 18 19 20 21

    22

    23

    2

    1

    O X

    Figure 3. Extraction of edge-hinge distribution.

    ments emerging from the central pixel and, subsequently,compute the joint probability distribution of the orientationsof the two fragments.

    To have a more intuitive picture of the feature that we areproposing, imagine having a hinge laid on the surface of theimage. Place its junction on top of every edge pixel, thenopen the hinge and align its legs along the edges. Considerthen the angles

    and

    that the legs make with the hor-izontal and count the found instances in a two dimensionalarray of bins indexed by

    and

    . The nal normalizedhistogram gives the joint probability distribution

    quantifying the chance of nding in the image two hingededge fragments oriented at the angles

    and

    .As already mentioned, in our case edges are usually

    wider than 1-pixel and therefore we have to impose an extraconstraint: we require that the ends of the hinge legs shouldbe separated by at least one non-edge pixel. This makescertain that the hinge is not positioned completely inside thesame piece of the edge strip. This is an important detail, aswe want to make sure that our feature properly describes theshapes of edges (and implicitly the shapes of handwriting)and avoids the senseless cases.

    If we consider an oriented edge fragment AB, the ar-rangement of the hinge is different whether a second ori-ented edge fragment attaches in A or in B. So we have tospan all the four quadrants (360 ) around the central junc-tion pixel when assessing the angles of the two fragments.This contrasts with the previous feature for which spanningthe upper two quadrants (180 ) was sufcient because ABand BA were identical situations.

    Analogously to the previous feature, we considered 3, 4and 5-pixel long edge fragments. This time, however, theirorientation is quantized in = 16, 24 and 32 directionsrespectively (g. 3 is an example for = 24). From thetotal number of combinations of two angles we will con-

    sider only the non-redundant ones (

    ) and we willalso eliminate the cases when the ending pixels have a com-mon side. Therefore the nal number of combinations is and, accordingly, our hingefeature vectors will have 104, 252 and 464 components.

    360180

    0

    phi1

    360

    1800

    phi2

    0

    0.005

    0.015

    0.025

    p(phi1, phi2)

    Figure 4. Graphical representation of the edge-hingeprobability distribution. One half of the 3D plot (situatedon one side of the main diagonal) is at because we onlyconsider the angle combinations with

    .

    For the purpose of comparison, we evaluated also threeother features widely used for writer identication:Run-length distributions

    Run lengths, rst proposed for writer identication byArazi [1], are determined on the binarized image taking into con-sideration either the black pixels corresponding to the ink traceor, more benecially, the white pixels corresponding to the back-ground. Whereas the stati stical properties of the black runs mainlypertain to the ink width and some limited trace shape characteris-tics, the properties of the white runs are indicative of characterplacement statistics. There are two basic scanning methods: hori-zontal along the rows of the image and vertical along the columnsof the image. Similarly to the edge-based directional features pre-sented above, the histogram of run lengths is normalized and inter-preted as a probability distribution. Our particular implementationconsiders only run lengths of up to 100 pixels (the height of a writ-ten line in our data set is of about 120 pixels).

    AutocorrelationEvery row of the image is shifted onto itself by a given off-

    set and then the normalized dot product between the original rowand the shifted copy is computed. The maximum offset (delay)corresponds to 100 pixels. All autocorrelation functions are thenaccumulated for all rows and the sum is normalized to obtain azero-lag correlation of 1. The autocorrelation function detects thepresence of regularity in writing: regular vertical strokes will over-lap in the original row and its horizontally shifted copy for offsetsequal to integer multiples of the local wavelength. This results ina large dot product contribution to the nal histogram.Entropy

    The entropy measure used here focuses on the amount of in-formation, normalized by the amount of ink (black pixels) in theregions of interest. This was realized by using the normalized lesize of ROI les after Lempel-Zif compression. The size of theresulting le (in bytes) is divided by the total number of black pix-els which closely estimates the amount of ink present on the page.

    3

    oceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)7695-1960-1/03 $17.00 2003 IEEE

  • 8/14/2019 2003 Writer Identification -- Directional Features

    4/5

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 20 30 40 50 60 70 80 90 100

    P r o

    b a b

    i l i t y o

    f c o r r e c

    t h i t %

    List size

    p(phi1, phi2) - 464p(phi1, phi2) - 252p(phi1, phi2) - 104

    dp(phi) - 15p(phi) - 16p(phi) - 12p(phi) - 8

    Figure 5. Performance curves of the edge-based fea-tures and

    for different direction quantiza-tions (features are ordered with most effective at the top).

    The obtained feature gives an estimate of the entropy of the ink distribution on the page.

    4. Results

    Evaluation method

    The efcacy of the considered features has been eval-uated using nearest-neighbor classication in a leave-one-out strategy. Explicitly, one page is chosen and extractedfrom the total of 500 pages (notice that the data set con-tains 2 pages written by each of 250 subjects). Then theEuclidean distances are computed between the feature vec-tor of the chosen page and the feature vectors of all of theremaining 499 pages. These distances are ranked startingwith the shortest one. Ideally the rst ranked page shouldbe the pair page produced by the same writer: an ideal fea-

    ture extraction making classication effortless and a remap-ping of the feature space unnecessary. If one considers, notonly the nearest neighbor (rank 1), but rather a longer listof neighbors starting with the rst and up to a chosen rank (e.g. rank 10), the chanceof nding the correct hit increaseswith the list size. The curve depicting the dependency of theprobability of a correct hit vs. the considered list size givesan illustrative measure of performance. Better performancemeans higher probability of correct hit for shorter list sizeswhich is equivalent to a curve drawn as much as possibletoward the upper-left corner.

    We point out that we do not make a separation betweena training set and a test set, all the data is in one suite. This

    is actually a more difcult and realistic testing condition,with more distractors: not 1, but 2 per false writer and onlyone correct hit. Error rates are approximately halved whenusing the traditional train/test set distinction. Note also theadded fact that we only have 2 samples per writer (more

    List

    comb.size 8 12 16 15 104 252 464 564

    1 26 30 35 45 45 57 63 752 34 39 45 55 55 67 71 833 40 47 52 62 64 73 75 864 45 52 57 66 69 77 79 875 49 57 62 70 72 78 81 896 53 60 65 72 73 80 83 917 58 63 68 74 75 82 85 928 60 64 69 75 78 83 86 939 62 65 71 76 79 83 87 93

    10 64 68 72 78 80 84 88 9411 66 69 74 79 81 85 88 9412 68 72 76 81 82 86 88 9513 70 73 77 82 82 87 89 95

    14 71 74 78 83 83 87 89 9515 72 76 79 84 83 88 89 9516 74 77 80 84 84 88 90 9617 76 79 82 84 85 89 90 9618 77 80 82 85 85 89 90 9619 78 81 82 86 86 90 91 9720 79 81 83 87 86 90 91 97

    Table 1. Writer identication accuracy (in percentages)on the Firemaker data set (250 writers, 2 pages / writer).

    labeled samples increasing the chance of a correct identi-cation of the author - see reference [6] for results on 10writers, 15 documents / writer). As a consequence of thesecircumstances our results are more conservative.

    Analysis of performances

    We present the performance curves of the edge-based di-rectional features in g. 5 and the numerical values in Ta-ble 1. The dimensionality of every feature is mentioned inthe gure and in the heading of the table.

    Conrming our initial expectations, the improvement inperformance yielded by the new feature is very signicantdespite the excessive dimensionality of the feature vectors(veried by PCA analysis). As a second-order feature, thehinge angular probability distribution captures larger rangecorrelations from the pixel space and therefore it charac-

    terizes more intimately the handwriting style providing formore accurate writer identication.Examination of the family of curves in g. 5 attests that

    ner quantized directions result in improved performance atthe expense of an increase in feature vector dimensionality(much more sizeable for the edge-hinge feature

    ).Figure 6 givesa general overview of the comparativeper-

    formance for all the features considered in this paper.The edge-based directional features perform signi-

    cantly better than the other features because they give amore detailed and intimate information about the peculiar-ities of the shapes that the writer produces (slant and regu-larity of writing, roundness or pointedness of letters).

    An interesting observation is that the vertical run lengthson ink are more informative than the horizontal ones. Thiscorrelates with an established fact from on-line handwritingrecognition research stating that the vertical component of strokes carries more information than the horizontal one [4].

    4

    oceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)7695-1960-1/03 $17.00 2003 IEEE

  • 8/14/2019 2003 Writer Identification -- Directional Features

    5/5

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 20 30 40 50 60 70 80 90 100

    P r o

    b a b

    i l i t y o

    f c o r r e c

    t h i t %

    List size

    combined - 564p(phi1, phi2) - 464

    dp(phi) - 15p(phi) - 16

    run-length horiz. white - 100run-length vert. white - 100

    autocorrelation - 100run-length vert. black - 100

    run-length horiz. black - 100entropy - 1

    Figure 6. Performance curves for the evaluated features(features are ordered with the most effective at the top).

    The presented features are not totally orthogonal, butnevertheless they do offer different points of view on ourdata set. It is therefore natural to try to combine them forimproving the accuracy of writer identication. This topicwill be a major interest in our future research. We willpresent here though, for exemplication, some initial resultsobtained by concatenating the edge-hinge distribution withthe horizontal run lengths on white into a single feature vec-tor that was afterward used for nearest-neighbor classica-tion (last column in Table 1).Stability test

    An important question arises: what is the degradation inperformance with decreasing amounts of handwritten ma-terial? We provide three reference points: whole page ( ),half page (top ( ) and bottom ( )), and the rst line ( ). Theanswer to this question has major bearing for forensic ap-plications (where, in many cases, the available amount of handwritten material is sparse, e.g. the lled in text on abank invoice or the address on a perilous letter).

    We consider writer identication accuracies for hit listsup to rank 10 (deemed as a more reliable anchor point). Ourresults from Table 2 show signicant degradation of per-formance when very little handwritten material is available.However, it is interesting to observe that the performancestandings of the different features with respect to each otherremain the same, independent of the amount of text.

    5. Conclusion

    We proposed a new edge-based feature for writer iden-

    tication that characterizes the changes in direction under-taken during writing. It performs markedly better than allthe other evaluated features.

    Our stability test show that the best performing featureswhen a large amount of text is available still perform best

    Feature w t b l

    88 81 84 53 72 66 69 36

    run- length horiz. whi te 57 42 42 18run-length vert . whi te 51 39 42 16run-leng th vert . b lack 36 33 33 13

    entropy 8 4 6 5

    Table 2. Feature performance degradation with decreas-ing amounts of written text (writer identication accuracyin percentages for list size = 10).

    compared to the others when little text is available, despitehaving considerably higher dimension.The results reported here, based on a very clean data set,

    have an academic relevance as to the usefulness of differentfeatures and they do not fully address problems like sizeinvariance, non-uniform background, degraded documents.

    Our future research interest will focus on reducing thedimensionality of the feature vectors while still keepingtheir discriminatory power and on further improving per-formance by combining different features in order to exploittheir intrinsic degree of orthogonality.

    References

    [1] B. Arazi. Handwriting identication by means of run-lengthmeasurements. IEEE Trans. Syst., Man and Cybernetics ,SMC-7(12):878881, 1977.

    [2] J.-P. Crettez. A set of handwriting families: style recognition.In Proc. of the Third International Conference on Document Analysis and Recognition , pages 489494, Montreal, August1995. IEEE Computer Society Press.

    [3] F. Maarse, L. Schomaker, and H.-L. Teulings. Automaticidentication of writers. In G. van der Veer and G. Mulder,editors, Human-Computer Interaction: Psychonomic Aspects ,pages 353360. Springer, New York, 1988.

    [4] F. Maarse and A. Thomassen. Produced and perceived writ-ing slant: differences between up and down strokes. ActaPsychologica , 54(1-3):131147, 1983.

    [5] R. Plamondon and F. Maarse. An evaluation of motor modelsof handwriting. IEEE Trans. Syst. Man Cybern , 19:10601072, 1989.

    [6] H. Said, G. Peake, T. Tan, and K. Baker. Writer identicationfrom non-uniformly skewed handwriting images. In Proc. of the 9th British Machine Vision Conference , pages 478487,1998.

    [7] L. Schomaker and L. Vuurpijl. Forensic writer identication:A benchmark data set and a comparison of two systems [inter-nal report for the Netherlands Forensic Institute]. Technicalreport, Nijmegen: NICI, 2000.

    [8] L. R. B. Schomaker, A. J. W. M. Thomassen, and H.-L. Teul-ings. A computational model of cursive handwriting. InR. Plamondon, C. Y. Suen, and M. L. Simner, editors, Com- puter Recognition and Human Production of Handwriting ,pages 153177. Singapore: World Scientic, 1989.

    5

    oceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)7695-1960-1/03 $17.00 2003 IEEE