oriya writer identification

download oriya writer identification

of 20

Transcript of oriya writer identification

  • 8/11/2019 oriya writer identification

    1/20

  • 8/11/2019 oriya writer identification

    2/20

    Abstract

    In handwriten writer identification and character recognition we have

    done a image based analysis,where a scanned digital image containinghandwriten script is taken as input, then system translate it into an

    machine editable readable digital text format.oriya language present great

    challenges due to the large number of letters in alphabet set,the

    sophisticated ways in which they combine and many letters are roundish

    and similar to look .

    In this project an attempt is made to recognize the oriya characters by use

    of HISTOGRAM OF GRADIENT features of character image. The features soobtained are passed through the HMM code which gives out the

    identification result.

    Keywords: character recognition.writer identification,histogram of

    gradient,Hidden Markov Model(HMM)

  • 8/11/2019 oriya writer identification

    3/20

    OUTLINE:-

    1.Abstract

    2.

    Objective3. Introduction

    4.Proposed approach

    5.Pre-processing

    Otsu Binarization

    Line segmentation

    Word segmentation

    Zone segmentation of words Character segmentation

    6.Feature extraction

    Local Gradient Histogram Feature(H.O.G)

    7. Identification

    Hidden Markov Model (HMM)

    8. Results and outputs

    9.

    Discussion10. Conclusion

    11. Bibliography

  • 8/11/2019 oriya writer identification

    4/20

    OBJECTIVE:-

    1.

    Identification of writer by scanning handwritten oriya documents. Compare the results of writer identification with zone

    segmentation and without zone segmentation.

    2.Recognition of each character written in the document.

    Identify which Oriya character written and then convert them to

    corresponding English letter based on a dictionary.

    INTRODUCTION:-

    Oriya(oi), officially spelled Odiais an Indian language belonging to

    the Indo-Aryan branch of the Indo-European language family. It is the

    predominant language of the Indian states of Odisha, where native

    speakers comprise 80% of the population, and it is spoken in parts of West

    Bengal, Jharkhand, Chhattisgarh and Andhra Pradesh. Oriya is one of themany official languages in India; it is the official language of Odisha and

    the second official language of Jharkhand.

    Since it is an old language there are various old documents present whose

    writers are unknown. My project deals with this problem. Its main aim is

    to identify who is the writer. And the Other part of the project is to

    identify each character written.

    Due to the presence of complex features such as headline, vowels,

    modifiers, etc., character segmentation in Oriya script is not easy. Also, the

    position of vowels and compound characters make the segmentation task

    of words into characters very complex. To take care of this problem we

    tried a novel method considering a zone wise break up of words and next

    HMM based recognition. In particular, the word image is segmented into 3

    zones, upper, middle and lower, respectively. The components in middlezone are modelled using HMM. By this zone segmentation approach we

  • 8/11/2019 oriya writer identification

    5/20

  • 8/11/2019 oriya writer identification

    6/20

  • 8/11/2019 oriya writer identification

    7/20

    Pre-Processing :

    (Binarization/Thresholding)

    Binarization is a process in which a graythresh, RGB, BMP etc. images are

    converted into a binary image.

    Lets consider a graythresh image. A graythresh image consists of pixels

    each of which has a depth/height of 256 bits. This height represents the

    intensity level of each pixel.

    The thresholding method of binarization is basically to determine a

    threshold or index. Once this threshold is obtained, we divide the image

    into two classes. Intensities above this threshold fall under white class

    and below this threshold as black class.

    For binarization a single threshold is selected. There are various methods.

    Mostly two of them are preferred

    Automatic thresholding

    Otsu binarization

  • 8/11/2019 oriya writer identification

    8/20

    Otsu binarization is the optimal thresholding technique. Mostly this

    thresholding is preferred as it finds the threshold based on the inter-class

    variance.

    Problem with the automatic thresholding is that whenever the valley

    between the two classes is small then the threshold obtained is erroneous.

    Hence Otsu binarization is used.

    Otsus Thresholding Method Based on a very simple idea: Find the threshold that minimizes the

    weightedwithin-class variance.

    This turns out to be the same as maximizing the between-class variance.

    Operates directly on the gray level histogram [e.g. 256 numbers, P(i)], so

    its fast (once the histogram is computed).

    Method

    In Otsu's method we exhaustively search for the threshold that minimizes

    the intra-class variance (the variance within the class), defined as a

    weighted sum of variances of the two classes:

    Weights are the probabilities of the two classes separated by a threshold

    and variances of these classes.

    Otsu shows that minimizing the intra-class variance is the same as

    maximizing inter-class variance:

    which is expressed in terms of class probabilities and class means .

    The class probability is computed from the histogram as :

  • 8/11/2019 oriya writer identification

    9/20

  • 8/11/2019 oriya writer identification

    10/20

    the peak/valley points of the histogram, individual lines are generally

    segmented. Although this global horizontal projection method

    is applicable for line segmentation of printed documents, it cannot be used

    in unconstrained handwritten documents because the characters of two

    consecutive text-lines may touch or overlap. For example, see the 4th and

    5th text lines of the document shown in figure below. Here,

    these two lines are mostly overlapping. To take care of unconstrained

    handwritten documents, we use here a piece-wise projection method as

    below. Here, at first, we divide the text into vertical stripes of width W(here we assume that a document page is in portrait mode). Width of the

    last stripe may differ from W. If the text width is Z and the number of stripe

    is N, the width of the last stripe is [Z W (N 1)].

    Computation of W is discussed later. Next, we compute piece-wise

    separating lines (PSL) from each of these stripes. We compute the row-wisesum of all black pixels of a stripe. The row where this sum is zero is a PSL.

    We may get a few consecutive rows where the sum of all black pixels is

    zero. Then the first row of such consecutive rows is the PSL. The PSLs

    of different stripes of a text are shown in figure 2a by horizontal lines. All

    these PSLs may not be useful for line segmentation. We choose some

    potential PSLs as follows. We compute the normal distances between two

    consecutive PSLs in a stripe. So if there are n PSLs in a stripe we get n 1

    distances. This is done for all stripes. We compute the statistical mode

  • 8/11/2019 oriya writer identification

    11/20

    (MPSL) of such distances. If the distance between any two consecutivePSLs of a stripe is less than MPSL, we remove the upper PSL of these twoPSLs. PSLs obtained after this removal are thepotential PSLs. The

    potential PSLs obtained from the PSLs of figure 2a are shown

    in figure 2b. We note the left and right co-ordinates of each potential PSL

    for future use. By proper joining of these potential PSLs, we get individual

    text lines. It may be noted that sometimes because of overlapping or

    touching of one component of the upper line with a component of the lower

    line, we may not get PSLs in some regions. Also, because of some modified

    characters of Oriya (e.g. ikar, chandrabindu) we find some extra PSLs in a

    stripe. We take care of them during PSL joining, as explained next. Joining

    of PSLs is done in two steps. In the first step, we join PSLs from right to

    left and, in the second step, we first check whether line-wise PSL joining iscomplete or not. If for a line it is not complete, joining from left to right is

    done to obtain complete segmentation. We say PSLs joining of a line is

    complete if the length of the joined PSLs is equal to the column (width) of

    the document image. This two-step approach is done to get good results

    even if two consecutive text lines are overlapping or connected. To join a

    PSL of the ith stripe, say Ki , to a PSL of (i 1)th stripe, we check whether

    any PSL, whose normal distance from Ki is less than MPSL,, exists or not

    in the (i

    1) stripe. If it exists, we join the left co-ordinate of Ki with the

    right co-ordinate of the PSL in the (i 1)th stripe. If it does not exist, we

    extend the Ki horizontally in the left direction until it reaches the left

    boundary of the (i 1)th stripe or intersects a black pixel of any component

    in the (i 1)th stripe. If the extended part intersects the black pixel of a

    component of the (i 1)th stripe, we decide the belongingness of the

    component in the upper line or lower line. Based on the belongingness of

    this component, we extend this line in such a way that the component fallsin its actual line. Belongingness of a component is decided as follows.

    We compute the distances from the intersecting point to the topmost and

    bottommost point of the component. Let d1 be the top distance and d2 thebottom distance. If d1 < d2 and d1 < (MPSL/2) then the component

    belongs to the lower line. If d2 d1 and d2 < (MPSL/2)then the

    component belongs to the upper line. If d1 > (MPSL/2) and d2 > (MPSL/2)then we assume the component touches another component of the lower

    line. If the component belongs to the upper-line (lower-line) then the line is

  • 8/11/2019 oriya writer identification

    12/20

    extended following the contour of the lower part (upper part) of the

    component so that the component can be included in the upper

    line (lower line). The line extension is done until it reaches the left

    boundary of the (i 1)th stripe. If the component is touching, we detect

    possible touching points based on the structural shape of the touching

    component. From the experiment, we notice that in most of the touching

    cases there exist junction/crossing shapes or there exist some obstacle

    points in the middle E-SEGM) is as follows.

    Word segmentation:For word segmentation from a line, we compute vertical histograms of the

    line. In general, the distance between two consecutive words of a line isgreater than the distance between two consecutive characters in a word.

    Taking the vertical histogram of the line and using the above distance

    criteria we segment words from lines. For example, see figure 3a.

    A very simple algorithm can be followed. Vertical smoothing can be done

    similar to the one explained in zone segmentation (horizontal smoothing).

    Then clear valley of histogram is obtained in between the words as a result

    we make a division of words at these valley positions.

  • 8/11/2019 oriya writer identification

    13/20

    Zone Segmentation:A word in Oriya can be divided into three zones, the upper zone, middle

    zone and lower zone. The segmentation of a word into corresponding

    three regions is shown in figure 4.The modifier like ekar is in upper zone.

    The vowels and consonants are in the middle zone.

    Lastly the ukaror rukaretc lie in the lower zone.

    Figure 4: (a) Original Word. (b) Zone segmented word (upper,mid,lower).

    Why to do segmentation?

    In the upper and lower regions mostly the modifiers are written. While

    writing the modifier in maximum cases the writer makes touching

    characters, makes irregular shapes. So it is found that writer identification

    and character recognition with zone segmentation gave a better result

    than that of without zone segmentation.

    Algorithm1.A window (w=length of character) is traversed in the row direction.

    2.For each window a smoothing of character is done.

    If the distance between two black pixel is less than the

    threshold then all the horizontal pixels between them are made

    black.

    Near the starting and ending of the image smoothing is not

    obtained.

  • 8/11/2019 oriya writer identification

    14/20

    3.A horizontal histogram is plotted for each window.

    4.Based on the valley and mountain obtained in histogram zone

    segmentation is obtained.

    Character segmentation:The identification of writer is done word wise while the character

    recognition is done character by character, so character segmentation is

    required.The middle zone is taken then vertical smoothing is done then based on

    the valley and mountain the character segmentation is done.

    Various other method are used for character segmentation like water

    reservoir method, very effective in Hindi and Bengali, but is not effective

    method in Oriya text.

    Figure 5: character segmentation from words.

    Feature extraction(HOG):Feature Extraction:Local gradient histogram (LGH) [19] has been used

    for feature extraction in our approach. Here, a sliding window traverses the

    image from left to right in order to produce a sequence of overlapping sub-

    images. Each window is sub-divided into 4 4 (4 rows and 4 columns)

  • 8/11/2019 oriya writer identification

    15/20

  • 8/11/2019 oriya writer identification

    16/20

    Identification (Hidden Markov Model):Hidden Markov Model: The feature vector sequence is processed usingleft-to-right continuous density HMMs [11]. One of the important features

    of HMM is the capability to model sequential dependencies. An HMM can

    be defined by initial state probabilities , state transition matrixA = [],i, j=1,2,,N, where denotes the transition probability from state i tostate j and output probability modeled with continuous outputprobability density function . The density function is written as ,where x represents k dimensional feature vector. A separate Gaussian

    mixture model (GMM) is defined for each state of model. Formally, the

    output probability density of statejis defined as

    ( ) where, is the number of Gaussians assigned to j. and denotes a Gaussian with mean and covariance matrix and is the weight coefficient of the Gaussian component k of state j. For amodel , if O is an observation sequence O = (,,..,) which isassumed to have been generated by a state sequence Q= (Q1, Q2,.,QT), of

    length T, we calculate the observations probability or likelihood as

    follows:

    |

    where is initial probability of state 1.

    In the training phase, the transcriptions of the middle zone of the wordimages together with the feature vector sequences are used in order to train

    the character models. The recognition is performed using the Viterbi

    algorithm. For the HMM implementation, we used the HTK toolkit.

  • 8/11/2019 oriya writer identification

    17/20

    Results and output:

    Writer identification without zone segmentation

    Sample results:

    #!MLF!#

    "?/w1wd7.rec"

    0 500000 w1 -14658.187500

    .

    "?/w1wd8.rec"

    0 600000 w1 -14241.120117

    .

    "?/w1wd9.rec"

    0 1100000 w2 -29309.158203

    .

    "?/w2wd7.rec"

    0 800000 w2 -12532.899414

    .

    "?/w2wd8.rec"

    0 900000 w2 -18292.097656

    .

    "?/w2wd9.rec"

    0 500000 w1 -16671.017578

    .

    "?/w3wd7.rec"

    0 800000 w3 -15888.551758

    .

    "?/w3wd8.rec"

    0 1500000 w3 -29638.150391

    .

    "?/w3wd9.rec"

    0 1100000 w3 -22744.312500

    .

    "?/w4wd7.rec"

    0 600000 w4 -20115.078125

    .

    "?/w4wd8.rec"

    0 600000 w1 -16832.888672

    .

    "?/w4wd9.rec"

    0 700000 w3 -17952.589844

    .

    "?/w5wd7.rec"

    0 1100000 w3 -21285.880859

    .

    "?/w5wd8.rec"

    0 1100000 w4 -23600.558594

    .

    "?/w5wd9.rec"

    0 500000 w2 -16885.208984

    .

  • 8/11/2019 oriya writer identification

    18/20

    Writer identification with zone segmentationSample results:

    #!MLF!#

    "?/w2wd10.rec"

    0 800000 w2 5093.161621

    .

    "?/w2wd11.rec"

    0 600000 w2 3443.149170

    .

    "?/w2wd12.rec"

    0 700000 w2 4561.872559

    .

    "?/w3wd10.rec"

    0 1000000 w3 6964.748535

    .

    "?/w3wd11.rec"

    0 700000 w3 3812.657227

    .

    "?/w3wd12.rec"

    0 700000 w3 4499.316406

    .

    "?/w4wd10.rec"

    0 400000 w4 2198.743408

    .

    "?/w4wd12.rec"

    0 600000 w4 3866.389648

    .

    "?/w5wd10.rec"

    0 700000 w5 4680.817383

    .

    "?/w5wd11.rec"

    0 500000 w4 3532.066162

    .

    "?/w5wd12.rec"

    0 900000 w5 5889.884766

    .

  • 8/11/2019 oriya writer identification

    19/20

    Discussion:so in here we can see clearly see that the results obtained for writer

    identification is better in case of zone segmented image rather than

    without zone segmented images.

    Undergoing work:

    The writer identification part is complete but the recognition of oriya text

    is under process. The recognition of Oriya text document in continuing.

    The completion of this will require more 3 months of continuous work.

    Conclusion:

    The writer identification of writer was successfully carried out and

    significant results were obtained.A scheme for segmentation of

    unconstrained Oriya handwrittentext into lines, words and characters is

    proposed in this paper. Here, at first, the text image is segmented into

    lines, and then lines are segmented into individual words. Next, forcharacter segmentation from words, initially, isolated and connected

    (touching) characters in a word are detected. Using structural, topological

    and water reservoir concept-based features, touching characters of the word

    are then segmented into isolated characters. To the best of our knowledge,

    this is the first work of its kind on Oriya text. The proposed water reservoir-

    based approach can also be used for other Indian scripts where touching

    patterns show similar behavior.

  • 8/11/2019 oriya writer identification

    20/20

    Bibliography:[1] U. Pal, B. B. Chaudhuri, "OCR in Bangla: an Indo-Bangladeshi Language", Proceedings of the 12th IAPR

    International Conference on Pattern Recognition B:ComputerVision & Image Processing, 1994.[2] Sukalpa Chanda, Katrin Franke, Umapada Pal and Tetsushi Wakabayashi, "Text Independent Writer

    Identification for Bengali Script", Proc. 20th International Conference on Pattern Recognition, 2010,pp.2005-2008.

    [3]

    U. Pal, A. Belaid, and C. Choisy, "Touching numeral segmentation using water reservoir concept,"Pattern Recognition Letters, pp. 261-272, 2003.

    [4] J. M. White and G. D. Rohrer, "Image thresholding for optical character recognition and otherapplications requiring character image extraction," IBM J. of Res. and Dev., vol. 27, pp. 400-411, 1983.(Pubitemid 13591061)

    [5] O. Tuzel, F. Porikli, and P. Meer, "Pedestrian detection via classification on riemannian manifolds, "IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1713-1727, 2008.

    [6] L. R. Rabiner "A Tutorial on HMM and Selected Applications in Speech Recognition", IEEEProceedings, vol. 77, pp.257 -286 1989

    [7] M. Chen , A. Kundu and S. N. Srihari "Variable Duration HMM and Morphological Segmentation for

    Handwritten Word Recognition", IEEE Trans. on Image Proc., vol. 4, no. 12, pp.1675 -1688 1995[8] A. Mohan, C. Papageorgiou, and T Poggio, "Example-based object detection in images by components, "

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 349-361, 2001.[9] D. G. Lowe, "Distinctive image features from scale-invariant keypoints, " International Journal of

    Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.[10] J. Yen, F. Chang, and S. Chang, "A new criterion for automatic multilevel thresholding," IEEE Trans.

    Image Processing, vol. 4, no. 3, pp. 370-378, 1995.[11] B. B. Chaudhuri, U. Pal and M. Mitra, "Automatic recognition of printed Oriya script", Sadhana, Vol.27,

    part 1. pp.23-34, February 2002[12] U. Pal, N. Sharma, and F. Kimura, "Oriya offline handwritten character recognition", In Proc.

    International Conference on Advances in Pattern Recognition, pp. 123-128, 2007.[13] U. Pal and B. B. Chaudhuri, "Indian Script Character Recognition: A Survey", Pattern Recognition,

    Vol.37, pp. 1887-1899, 2004.