Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width,...

27
Document classification Miroslav Halas [email protected] 1. Introduction The society is shifting towards dominant use of digital information. We have the option to receive bank statements, credit card bills, shopping invoices or proofs of delivery for the goods we have purchased online in the form of electronic documents. Many countries have passed laws making digital signature equal counterpart of the traditional one. The electronic representation of documents has allowed us to access information conveyed by these documents in more efficient manner. The storage of electronic information has become more reliable making the fear of loosing archives in the events similar to the destruction of Great Library of Alexandria [33] hopefully thing of the past. The cost of duplication of electronic information became insignificant enabling on one hand beneficial transfer of knowledge and on the other hand raising concerns about collapse of the traditional business models for publishing, music and visual media distribution. In spite of all the advantages of electronic media, people still prefer paper to record, transfer and archive information. Paper is still the prevalent medium used in the offices to conduct business. This might be due to the fact that paper has been used in this manner for about 5000 years since the invention of papyrus [32]. We have learned to trust it as well as deal with its shortcomings. It is uncertain how long it will take to gain the same level of trust and familiarity with electronic documents. Before a complete shift to paperless world can occur, we need to

Transcript of Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width,...

Page 1: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

Document classificationMiroslav Halas

[email protected]

1. Introduction

The society is shifting towards dominant use of digital information. We have the option to receive bank statements, credit card bills, shopping invoices or proofs of delivery for the goods we have purchased online in the form of electronic documents. Many countries have passed laws making digital signature equal counterpart of the traditional one. The electronic representation of documents has allowed us to access information conveyed by these documents in more efficient manner. The storage of electronic information has become more reliable making the fear of loosing archives in the events similar to the destruction of Great Library of Alexandria [33] hopefully thing of the past. The cost of duplication of electronic information became insignificant enabling on one hand beneficial transfer of knowledge and on the other hand raising concerns about collapse of the traditional business models for publishing, music and visual media distribution.

In spite of all the advantages of electronic media, people still prefer paper to record, transfer and archive information. Paper is still the prevalent medium used in the offices to conduct business. This might be due to the fact that paper has been used in this manner for about 5000 years since the invention of papyrus [32]. We have learned to trust it as well as deal with its shortcomings. It is uncertain how long it will take to gain the same level of trust and familiarity with electronic documents. Before a complete shift to paperless world can occur, we need to learn how to more efficiently handle the growing amount of paper we produce today.

2. Definitions

Meaning of term “paper document” can be very broad. In context of this paper we adopt from [34] definition of paper documents as “objects created expressly to convey information encoded as iconic symbols”. This definition therefore excludes non-symbolic documents such as portraits, medical or satellite images. Electronic document (sometimes also called image document [1] or just short document) is then for the purpose of this paper defined as ordered collection of images created using scanner or fax from the paper document. Individual images are called pages. Special kind of (paper or image) document is form. Form is different from other documents because as described in [16, 25] it contains almost exclusively horizontal and vertical lines creating Manhattan type layout (as defined in [18]). In addition to preprinted information it also contains user filled-in data. The locations with data are called fields. Many reviewed papers limit themselves

Page 2: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

even more and instead of considering any kind of paper based document they focus on documents consisting of only a single page ignoring relationships between multiple pages. Also the area of interest is mainly fill-in forms, scientific papers or business letters. This is probably due to the fact, that the most popular benchmark data sets, University of Washington image databases UW I – III, consist of images of documents in English language selected from scientific and technical journals [9, 14, 17, 19, 20, 21, 23, 25, 26].

Document image processing is process of recognizing, understanding and organizing documents based on given criteria [1]. Two distinct phases and generally recognized during this step [20, 25]. During document analysis phase the image is segmented into a regions of interests. During document understanding phase these regions are classified into different types such as text or image based on the geometric and physical characteristics of the objects. Also logical meaning such as title, abstract, paragraphs, address, summary is assigned to the recognized objects and a reading order between these is established. Both of these phases are often labeled with a common name document image understanding [36].

3. Document data mining applications

Document classification has many practical applications for recognition, understanding, organization and retrieval of image documents.

In [2] Appiani et al. state: “A medium size bank with some hundred branches produces from 30,000 up to 100,000 account notes and enclosures a day.” In order to process this amount of paper documents efficiently in an automated fashion every page have to be transferred to electronic form, analyzed and labeled for further computer processing. Once in the electronic form, classification of documents into predefined categories allows system to route them automatically for further processing improving the workflow of data in the organization. In [2] system named STRETCH (Storage and RETrieval by Content of imaged documents) is described. This system automates the process of classification, archiving and retrieval of documents. Their classification method is based on the custom decision tree algorithm creating DDT - document decision tree. The tree nodes represent modified XY trees obtained during document segmentation.

Transfer of existing archives of paper documents and books into more accessible electronic form provides another possibility for document classification. Libraries keep detailed catalogues of books and journals, but production of such catalogue in manual fashion is very labor intensive. Manual creation of 600 bibliographic records a day to populate MEDLINE, the database of National Library of Medicine, Bethesda, Maryland, required 246 hours of retyping of information [28]. This represents work of 30 people manually transferring data from paper to computer 8 hours a day. In [28] authors describe MARS-2, Medical Article Record System, which by classification of scanned pages into know categories of publications can automatically associate with these documents known attributes such as publisher or name of the journal. Familiar class of documents also

Page 3: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

makes the process of automated recognition and extraction of data from the images easier [1].

In the first two scenarios, document classification is used to assign document to multiple classes identified beforehand, while the next two scenarios effectively classify images into two classes, those similar to the example image and those, which are not.

In [11] Doermann et al. describe a scenario when documents can be imaged and added into a single database from multiple possibly distributed locations. If there exist multiple copies of the same paper document, the duplication of data may have undesirable effects on of storage and processing requirements or even on database integrity. In this case detection of duplicates in the process of building repository of digitized paper documents represents another useful application of document classification. Authors conclude that savings realized by duplicate detection in some large application can reach up to 25% of the cost.

Advances in efficiency and costs of hardware and software in recent years allowed organizations to easily digitize their paper documents and create this way large databases of images. Traditional database query languages such as SQL do not provide efficient way how to query these images. In [13] authors describe how concept of query by example can be an effective approach to retrieve images from a database. In their system named IDIR (Intelligent Document Image Retrieval) user specifies an example image and the database system returns all images that are similar to the input.

Collins-Thompson and Nickolov in [6] study an interesting problem of automatic document separation. When large number of individual documents are digitized together by scanning or faxing, existing imaging application effectively concatenate pages of multiple documents into a single file. In order to recover original document boundaries, existing applications require user to either insert special separator pages between documents or apply special symbols such as barcodes to the first page of each document. This process can be very labor intensive and therefore authors studied applicability of clustering to identify related pages, which would be pages belonging to the same document. They extract several physical characteristics of the page such as word height, character width, word and line spacing and line indentation and use these as a input to the Support Vector Machine to determine page similarity. According to the authors their work is the first published paper in this area.

4. Document image understanding

From the described list of problems addressed by document classification and clustering it is clear that the key element of such application is an ability to compute similarity between documents. In order to establish similarity the application needs to gain understanding of the document image.

Document image understanding is defined as understanding of a “formal representation of the abstract relationships indicated by the two dimensional

Page 4: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

arrangements of the symbols” [34]. The symbols can be individual pixels of the image or they can be higher level objects represented by these pixels. Documents can be analyzed on physical, semantic and functional level [10, 14, 15, 18].

Figure 1 The Relationship of geometric, functional and logical description (from [12])

Physical (geometrical) analysis is concerned with the physical representation of the content and the structure of the document. The physical characteristics of the content include attributes such as font style and size, spacing of the lines, or paragraphs. Physical representation of the structure often called geometric page layout expresses division of the document page into homogeneous regions, placement of these regions on the page and relationship among these regions. Homogeneous region can be for example single letter, line of text, paragraph, line or image.

Semantic (logical) analysis is concerned with the logical meaning of the content and the structure of the document. The semantic analysis of the content is trying to assign linguistic meaning to symbols and it is object of investigation of document understanding, which refers to “natural language aspect of one dimensional text flow” [34]. The semantic analysis of the structure encompasses recovering of logical meaning of individual blocks of the page (such as title, subtitle, abstract, image), ordering of the block in reading order and then classification of the documents into logical categories (such as journal article, invoice, letter).

Functional analysis is concerned with the purpose the documents play when conveying information to the audience. An example is given in [12] when the main purpose of novel is identified as reading the entire content, the purpose of magazine is

Page 5: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

browsing to become familiar with the concepts and the purpose of dictionary is searching since only portion of the document is examined for particular information.

Similarity among documents can be defined based on any of the above descriptions. Two documents can be declared to be similar if they use the same font typefaces and sizes, if their layout is the same (e.g. both use two column layout), if both documents represent the same type of document (such as business letter) or if both documents serve the same purpose (such as both are dictionaries).

In order to compute these similarities the document mining application needs to adopt many different techniques especially those summarized by [20]: “Document understanding is to convert existing paper documents into a machine readable form. It involves estimating the rotation skew of each document ages, determining the geometric page layout, labeling blocks as text, math, figure, table, halftone, etc., determining the text of text blocks through an OCR system, determining the logical structure, and formatting the data and information of the document in a suitable way for use by a word processing system, or by an information retrieval system.”

5. Human understanding of similarity

It is obvious that humans are able to assess similarity, purpose and even logical structure of documents just by looking at their physical representation (layout, font sizes, horizontal lines) without reading or even recognizing the content. In many cases humans don’t even require high resolution image to extract required information, thats why most computer programs use concepts of thumbnails to provide quick preview of the content of the documents. Since humans are the ultimate judges of the correctness of the output of application, the applications should try to emulate the human understanding of document similarity. If we therefore ignore classification using similarity based on the semantic analysis of the content of the document, it should be possible to derive all necessary information just by analyzing the physical structure of the document [3]. This realization is important because it allows us to avoid use of OCR (Optical Character Recognition) engine in determination of similarity of documents since this is very resource intensive [11].

Page 6: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

Figure 2 Representative images used for similarity experiments in [14]

Human judgment relevance study was described in [14, 15]. In this study subjects were presented with 12 representative thumbnails and asked to identify similar images and rate the level of similarity for the pool of 979 thumbnail from the UW-I database. Even though these images were taken from multi-page documents, they were treated as independent images and user were not asked to identify images related together by their original document. The results of this study were then used as a ground-truth data for measuring performance of the document page matching algorithms based on the geometric page layout analysis. The performance of investigated algorithm was deemed to be comparable to the human relevance judgment.

6. Physical structure analysis

6.1 Algorithm evaluation criteriaDocument classification is usually performed in high-volume environments such

as those described in [2, 28] and therefore all used algorithms should be evaluated using criteria applicable for these environments. Algorithms should be fast to guarantee constant time feature extraction and indexing, scalable to allow processing of large amount of data and accurate to avoid misclassification. In addition, the data the algorithms extract from the image documents should be robust to overcome degradation due to conversion, unique to facilitate accurate classification and compact to allow efficient storage, retrieval and matching of documents [11].

6.2 NoiseBefore a document image can be analyzed it needs to be converted into a digital

form using scanner or a fax. During conversion, the quality of the image usually degrades

Page 7: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

and an undesired artifacts also called noise are introduced. These are usually represented by small groups of pixels randomly distributed throughout the image. To deal with the noise either the image is processed using filters that remove these random pixels or the document layout analysis algorithms ignore elements with size below certain threshold [20, 21, 23]. The noise removal represents additional processing step and therefore based on the criteria outlined above, algorithms that are able to tolerate noise in the input data are more desirable.

6.3 SkewThe alignment of the image with the horizontal and vertical axis is often broken

during copying, scanning or faxing and the document becomes skewed. Many document layout analysis algorithms assume that the input image is deskewed [3, 9, 14], therefore the first step of image processing is removal of the skew. Different methods for skew correction were described in several available papers [22, 23, 30, 37] with reviews of existing algorithms available in [25, 36, 37].

Two most popular techniques are based on Hough transformation and projection profiles. Techniques based on Hough transformation can be very computationally expensive since the transformation from Cartesian space (x,y) into Sinusoidal space () is applied to every black pixel of the image. Aligned pixels, which should be pixels corresponding to horizontal and vertical lines, rise the peaks of the curves in the Hough space and allows us to detect angle of the skew. Goal of different improvements to this method is predominantly to limit the number of pixels required to be processed by this transformation. Common approach is for example to detect connected components and apply the transformation only to their centroids or to the middle point of the baseline. Similar improvements are also applied to techniques based on projection profiles. Projection profile is accumulated count of black pixels in a given directions. Based on the fact that in most documents text is organized along horizontal lines, the count is maximized in the direction of the skew of the document [22].

These algorithms usually expect that the image is scanned at high resolution (100-300DPI) to identify required features such as connected components. Once these features are identified, the number of pixels is consideration is decreased by subsampling or by processing only the significant ones (centroid, middle point of the bottom line). These high resolution images therefore require more resources (CPU, memory) to process. Okun et al. introduce in [22, 23] method to estimate the skew of the image from the low resolution images “using simple text row accumulation based on the statistics of the 1st

and 2nd orders” [23]. This method is therefore more suitable in system trying to imitate behavior described in the human judgment test. Comparison between this method and and advanced hough transformation based method shows similar performance but better accuracy.

Different approach is presented by Breuel in [8] when the algorithm for document layout analysis deals with the image skew and therefore doesn’t require skew detection and correction.

Page 8: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

6.4 SegmentationSegmentation is process of dividing the document into homogeneous areas of

interest and describing this way geometrical structure of the document. Several reviews of existing approaches are available in [15, 18, 25, 36].

Early and still popular algorithm for page segmentation was introduced by Wong at al. in [29] and it is based on run-length smoothing algorithm (RLSA). The RLSA algorithm effectively smears together black areas separated by less then predetermined number of white pixels. The RLSA algorithm is applied to the image in horizontal and vertical direction producing two distinct images. Homogeneous blocks of black pixels separated from each other by areas of white pixels are identified by applying AND operation on these two images.

The most popular top-down approach for document segmentation is recursive XY cut (RXYC) page segmentation of the document image. This algorithm is based on the assumption that the homogeneous blocks in the document can be bounded by rectangular regions, which are separated by white space (the original algorithm) or by horizontal and vertical lines (modified algorithm [2]). Algorithm then recursively cuts the document along the white spaces and lines alternating between horizontal and vertical direction. The position where the next cut is performed is determined by projection profile of black pixels in vertical and horizontal direction. The output of the algorithm is a hierarchical structure with root corresponding to the whole document and nodes corresponding to the regions of the documents extracted using the XY cuts.

Figure 3 Document image segmentation in bottom-up fashion (from [20])

Page 9: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

In the bottom-up approach described for example in [20] low level connected components are extracted from the document image. The extracted regions are classified as image or text. Projection profiles for text components are computed and used to perform recursive XY cuts at the most globally maximal valleys in the profiles. Textual regions are grouped together along the cuts to form lines and paragraphs. This process of identification and grouping is considered [21] to be time consuming especially if there are many connected components. Also the information about the top-down hierarchy of the identified regions is not identified even though methods have been proposed to preserve this structure. The complexity of the top-down is considered to be lower compared to the bottom-up approach. Top-down approach also better corresponds with human perception of layout recognition from the more coarse to more detailed examination [21].

Most of traditional approaches require well chosen thresholds for a correct feature extraction. Traditional XY cut algorithm for example requires 4 parameters for vertical and horizontal noise removal and vertical and horizontal valley detection. Lee and Ryu in [21] proposed top-down approach that can extract all required parameters from the image itself by identifying periodicity in projection profiles of bounded regions. This way the algorithm is able to correctly segment images with various font heights or line spacings. Authors show that their method combined with texture analysis for component classification performs better than the traditional algorithms. This is important result because optimization process during training phase to determine best input values for the algorithms can be very resource intensive and can take days as described in [25].

Model for performance evaluation of segmentation algorithms is proposed in [25, 26] and also a performance comparison of XY cut (top-down), Docstrum and Voronoi-Diagram based (both bottom up) as well another two commercial page segmentation algorithms is performed. XY cut algorithm is significantly faster but it’s segmentation quality is about 10% worse than other evaluated algorithms.

Alternative approach to the page segmentation is described by Breuel in [7, 8, 9]. In his work document layout is determined using globally maximal white space bounding rectangles, which are rectangles that bound the white space of the page background left around he document content. Since the algorithm requires as its input “axis-aligned set of rectangular obstacles” [8], which are bounding rectangles around document content, this algorithm cannot be used as primary segmentation algorithm, but it’s output can be used as alternative representation of document layout. No comparison about effectiveness of using this approach for document classification is provided, except the conclusion that complicated layouts may be characterized with a smaller number of bounding rectangles.

6.5 Local and global feature extractionExtraction of global and local features can provide another input to classification

algorithms. These features are statistical characteristics of the document image. Global features are extracted from image of the whole document while local features are extracted from blocks identified during page segmentation or from subsections of the

Page 10: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

document. They can be divided into several categories: textural, geometric, component, structural and content based [15]. Once identified, they can be used either for classification of identified regions into classes such as text, image and line required for correct segmentation of document or as an input into document classification algorithm.

In [1, 2] local features used as an input to page segmentation algorithm include normalized coordinates of the bounding box, average gray area of the region and flag indicating if the cut to extract this region was performed using white space or line. In [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering. In [14, 15] window of constant size is used to examine document and for each location of the window following features are extracted: dominant content type, column structure, statistics for height and width of connected components. Then a projection of statistics is computed for rows and columns of windows and these are captured. Global features are captured for the whole page: dominant point size, percentage of content, presence of large text, presence of table like regions, statistics on connected components (count, sum, mean, median, standard deviation, variance) for height, width, area, perimeter, centroid, density, circularity, aspect ratio, cavities, pixels runs. In [15] additional window based features are described: foreground pixel window density, foreground pixel bounding box density, foreground bounding box density, foreground to background pixel ratio, median connected component height and width, presence of large outliers among connected components. In [29] each segmented block is described by total number of black pixels in segmented block, minimum x, y coordinates of the block and its x, y lengths, total number of black pixels in original data from the block, horizontal white-black transitions of original data. These are used to compute the height of the block, the eccentricity of the bounding box (ratio of width and height), the ratio of the number of block pixels to the area of bounding box, the mean horizontal length of the original data from each block.

Values for these entries are then used as an input to rule based, decision tree based or neural network base classifiers [15, 36].

6.6 Shape codingSpecial technique called shape coding can be deployed for analyzing images of

text [10, 11, 13]. Using this technique individual symbols of the image are mapped to a set of character shape codes. The set used for mapping is smaller than the original character set used to write the document. The individual symbols are often identified as connected components and are expected to correspond to an individual letters of text. The shape characteristics used for encoding include for example detection if symbol is ascender or descender, if it is limited to x-line or if it is punctuation.

This technique can be used for document retrieval based on keywords when keywords are mapped to the the same set of character shape codes and then the search is performed using string matching. This technique may not be suitable for low resolution images such as thumbnails when the image doesn’t contain regions corresponding to letters.

Page 11: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

7. Classification of documents

Previous section has shown that the document image contains large amount of information that can be used for classification of documents. Brief review of existing approaches to document image classification can be found in [15, 36].

Measuring similarity of documents based on their spatial layout is generally difficult since it has to account for similar shapes and still handle subtle differences caused by deformations or errors in processing. Document page matching algorithm based on the geometrical structure of the page is described in [15]. The system keeps database of segmented template images. The input image to be classified is segmented into a homogeneous regions. Each region is matched with the database to identify all regions (and documents they belong to) of the same type that intersect with the input region. Once the initial mapping is identified set of rules is applied to identify quality of each match. The evaluation of the match tries to account for the situation when segmentation algorithm often incorrectly segments the block of text either into too many or too few segments along line and paragraph boundaries. Vertical boundaries are on the other hand considered easier to detect since they correspond to wider column dividers. Algorithm discriminates those mappings where the input block overlaps with two or more database image blocks which do not overlap horizontally. Only the subset that has maximal overlap with the query image but do not overlap horizontally is considered for the mapping. The same criteria is also inversely applied to each database image region. Once the final mapping is identified, algorithm computes sum of percentage of overlap between queried regions and database image regions. The sum of percentage of queried regions which don’t overlap with any database image region is also computed. Based on these two metrics candidate template documents are ranked according to the size of intersection. Comparison with the results of human judgment study showed that this algorithm performs comparably to human perception of similarity of documents. No performance or complexity analysis of the algorithm is provided.

To address the fact that not all types of documents can be described using fixed spatial relationships of their homogeneous block, another algorithm is presented in [15] trying to classify the documents based on extracted local and global features. The document is scanned by sections identified using window of fixed size. For each window connected components are identified. Each connected component is then classified based on its local features into 4 classes: text component, horizontal or vertical line and 2D line set. In the next step set of 20 local features is identified or computed for each window and for groups of window values. These values are then used as an input to a decision tree algorithm using OC1 software and to Kohonen self-organizing map. The OC1 algorithm as described in [15] uses a linear combination of features to determine a good split for each node and the goal is to divide the multidimensional space into homogeneous regions. Author used the “twoing value” as an impurity measure for the OC1 algorithm. The created decision tree is at the end pruned with factor of 20% to avoid overfitting. For the SOM author used G-SOM system implemented by University of Oulu. Using decision trees in experiment classifying several thousand tax forms from the NIST CD-

Page 12: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

ROM author achieved 99.7% classification accuracy while using SOM the obtained accuracy was 96.85%. No performance data are provided.

Algorithm which considers both spatial structure as well as extracted global and local features is described in [2]. The page segmentation is performed using modified XY cut page segmentation algorithm where cuts are performed along both white space as well as horizontal and vertical lines. Before page segmentation is used, noise is removed by removing connected components below threshold size. To handle broken or slightly skewed lines, line detection algorithm based on RLSA is used. Algorithm expects the input document to be already deskewed. The output of the algorithm is modified XY tree where each node contains 3 information: flag if line was used to split parent region, coordinates of the subregion under consideration and average gray level of the subregion. Authors introduce Document Decision Tree (DDT) as an extension of Geometric Tree (GTree) to allow classification of the input documents based on matching of subtrees extracted from modified XY trees. The DDT is constructed during the training phase. Modified XY trees of the template documents are inserted to the tree and pattern sharing technique is deployed to the input so that parent of each node shares the subtree which is common to all of its children. The classification is performed by walking the tree and computing similarity between the input tree and each node. Once the classification using DDT is performed and the class is determined by identifying the leaf node a logical classification based on feature extraction is used. The choice of the logical classification strategy varies based on the outcome of the decision tree. In the case of classification with high confidence, class specific strategy is used. If the confidence is low, class independent strategy is used. If DDT classification results in multiple template document with hight confidence level, the algorithm deploys class dependent strategy based on each potential class of image. The success rate for this classification technique was 97.8% and the algorithm took only about a second to process individual image.

In a follow up work [1], authors replaced the document decision tree and the logical classification based on local features with Hidden Tree Markov Model (HTMM) classification. HTMM is extension of Hidden Markov model that is able to learn probability distribution of labeled tree instead of traditional sequential domain. During the training phase a separate HTMM is trained for each class. During the evaluation phases segmentation of the input document is performed the same way as in [2]. Then each HTMM is applied on a segmented tree and the resulting class is the one corresponding to the tree with the highest output probability. Compared to [2] authors achieved success rate as high as 99.28% and decreased relative error compared to [2] by 34%. No performance data are provided.

Alternative approach to document classification is described in [3, 4, 5]. Input to the algorithm are segmented and deskewed document pages. Only text regions are considered therefore a local classification has to be already performed. Image is divided into grid of fixed size where each cell is called bin and roughly corresponds to a window used in [15]. Each bin is classified as text bin if at least half of it area overlaps a single text block or as a whitespace bin otherwise. Block in a row is defined to be a maximal consecutive sequence of text bins and authors defined several metrics how to compute

Page 13: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

distance between rows: edit distance, Manhattan distance, interval distance and cluster distance. Edit distance was found to be the most accurate measure of row similarity but can be also very computationally expensive. Interval and cluster distance are very accurate approximation of edit distance. Using the defined similarity metrics authors used k-means clustering algorithm to identify clusters fore each row. The identified clusters are then used as an input to Hidden Markov Model. To satisfy the sequential domain to which HMM applies, the input is examined in top-to-bottom fashion. The result show that the algorithm successfully classifies the input document when top two output classes are considered but when only the top class is considered, the accuracy is only around 70-80%. No performance data are provided.

8. Summary

In the presented paper we have discussed classification of digitized paper documents. The set of application described in section 3 allow us to conclude that this topic addresses very manually intensive area of document processing and can result in a considerable savings in both, human work and cost associated with these applications. In section 4 we have discussed different levels at which document classification can be performed. By considering human understanding of similarity in section 5 we were able to conclude that the physical structure of the document should contain enough information to allow us successfully classify the document on most of the level described in section 4 without deploying OCR to read the content of the document. In section 6 we have discussed techniques required to perform document image analysis and extract useful data from the document image. We have considered both the preprocessing phase when the data is cleansed using noise and skew removal process as well as the data extraction phase when the document physical layout structure is determined using segmentation algorithms and additional characteristics about the document image and the information it conveys are extracted by direct examination of either the pixels of the image or connected components identified during segmentation phase. In section 7 we have described document classification algorithms that utilize data extracted using techniques from section 6 to successfully classify input document image into one of predetermined classes. We have described algorithms that utilize spatial structure, global and local features or both. Wide set of classification techniques is utilized by these algorithms: decision trees and their extensions (DDT), self-organizing maps, hidden Markov models and their extensions (HTMM) as well as custom methods based on determination of overlap between segmented document images.

From the presented discussion it seems that there is a very good understanding of efficient techniques to examine document images and extract useful information from them, but the area of document image classification is still in infancy. We were not able to find set of clear guidelines or clear understanding of what data from the images are actually useful for document classification and how to map them to existing data mining classification techniques. I believe we will see more research in this area coming in the future. Also a clear performance comparison considering various algorithm is needed since many algorithms require extensive preprocessing steps in order to obtain useful data to work with.

Page 14: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

9. References

1. Michelangelo Diligenti, Paolo Frasconi, Marco Gori, “Hidden Tree Markov Models for Document Image Classification”, IEEE Transactions on pattern analysis and machine intelligence, vol 25, no 4, pages 519-523, 2003. Available at http://citeseer.nj.nec.com/diligenti03hidden.html

2. E. Appiani, F. Cesarini, A.M. Colla, M. Diligenti, M. Gori, S. Marinai, G. Soda, “Automatic document classification and indexing in high-volume applications”, International Journal on Document Analysis and Recognition, vol 4, no 2, pages 69-83, 2001. Available at http://www.softlab.ntua.gr/facilities/public/AD/Text%20Categorization/Automatic%20document%20classification%20and%20indexing%20in%20high-volume%20applications.pdf

3. Jianying Hu and Ramanujan Kashi and Gordon Wilfong, “Comparison and Classification of Documents Based on Layout Similarity”, Information Retrieval, vol 2, no 2/3, pages 227-243, 2000. Available at http://citeseer.nj.nec.com/282828.html

4. Jianying Hu and Ramanujan Kashi and Gordon T. Wilfong, “Document image layout comparison and classification”, Proceedings of the International Conference on Document Analysis and Recognition, pages 285-288, 1999. Available at http://citeseer.nj.nec.com/397390.html

5. Jianying Hu and Ramanujan Kashi and Gordon T. Wilfong, “Document Classification Using Layout Analysis”, DEXA Workshop, pages 556-560, 1999, Available at http://citeseer.nj.nec.com/hu99document.html

6. Kevyn Collins-Thompson and Radoslav Nickolov, “A Clustering-Based Algorithm for Automatic Document Separation”, Proceedings of the SIGIR 2002 Workshop on Information Retrieval and OCR, Available at http://citeseer.nj.nec.com/collins-thompson02clusteringbased.html

7. Thomas Breuel, “High Performance Document Layout Analysis”, 2003 Symposium on Document Image Understanding Technology Greenbelt Marriott, Greenbelt Maryland, Available at http://citeseer.nj.nec.com/568589.html

8. Thomas M. Breuel, “An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis”, Proceedings of the ICDAR 2003, vol 1, pages 66-70, 2003. Available at http://citeseer.nj.nec.com/561114.html

9. Thomas M. Breuel, “Two Geometric Algorithms for Layout Analysis”, Proceedings of Document Analysis Systems V, 5th International Workshop, DAS 2002, vol 2423, pages 188-199, Available at http://citeseer.nj.nec.com/517770.html

10. David Doermann, “The Indexing and Retrieval of Document Images: A Survey”, Computer Vision and Image Understanding: CVIU, vol 70, no 3, pages 287-298, 1998. Available at http://citeseer.nj.nec.com/doermann98indexing.html

Page 15: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

11. D. Doermann and H. Li and O. Kia, “The detection of duplicates in document image databases”, Proceedings of the International Conference on Document Analysis and Recognition, pages 314-318, 1997. Available at http://citeseer.nj.nec.com/doermann97detection.html

12. D. Doermann, E. Rivlin, and A. Rosenfeld, “The function of documents”, International Journal of Computer Vision, vol 16, pages 799–814, 1998. Available at http://citeseer.nj.nec.com/doermann96function.html

13. D. Doermann, J. Sauvola, H. Kauniskangas, C. Shin, M. Pietikainen, and A.Rosenfeld, “The development of a general framework for intelligents document image retrieval”, In Document Analysis Systems, pages 605-632, 1996. Available at http://citeseer.nj.nec.com/doermann96development.html

14. Christian K. Shin and David S. Doermann, “Classification of document page images based on visual similarity on layout structures”, Proceedings of the SPIE, vol. 3967, pages 182-190, 2000. Available at http://citeseer.nj.nec.com/shin00classification.html

15. Christian K. Shin, “The Roles of Document Structure in Document Image Retrieval and Classification”, PhD. Thesis, 2000. Available at http://lamp.cfar.umd.edu/pubs/Theses/Shin00.pdf

16. P. Duygulu, V. Atalay, “A Hierarchical Representation of Form Documents for Identication and Retrieval”, International Journal on Document Analysis and Recognition, vol5, no 1, pages 17-27, 2002. Available at http://citeseer.nj.nec.com/502087.html

17. J. Liang, R. Rogers, R.M. Haralick, and I.T. Phillips. “UW-ISL Document Image Analysis Toolbox: An Experimental Environment”, Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 984-988, 1997. Available at http://citeseer.nj.nec.com/liang97uwisl.html

18. Robert M. Haralick, “Document Image Understanding: Geometric and Logical Layout”, Proceedings of the CVPR94: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 385-390, 1994. Available at http://citeseer.nj.nec.com/haralick94document.html

19. J. Liang, I.T. Phillips, and R.M. Haralick, “Performance evaluation of document layout analysis on the UW data set”, Proceedings Document Recognition IV, pages 149-160, 1997. Available at http://citeseer.nj.nec.com/liang97performance.html

20. J. Liang, J. Ha, R.M. Haralick, and I.T. Phillips, “Document layout structure extraction using bounding boxes of different entities”, Proceedings of the 3rd IEEE Workshop on Applications of Computer Vision, pages 278-283, 1996, Available at http://citeseer.nj.nec.com/liang96document.html

21. Seong-Whan Lee and Dae-Seok Ryu, “Parameter-Free Geometric Document Layout Analysis”, IEEE Transactions on Pattern Analysis and Machine Inteligence, pages 1240-1256, 2001. Available at http://citeseer.nj.nec.com/558908.html

Page 16: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

22. Oleg Okun, Matti Pietikainen and Jaakko Sauvola, “Document skew estimation without angle range restriction”, International Journal of Document Analysis and Recognition, vol 2, pages 132-144, 1999. Available at http://www.mediateam.oulu.fi/publications/pdf/18.pdf

23. Oleg Okun, Matti Pietikainen and Jaakko Sauvola, “Robust skew estimation on low-resolution document images”,Proceedings of the 5th International Conference on Document Analysis and Recognition, Bangalore, India, pages 621-624, 1999. Available at http://citeseer.nj.nec.com/okun99robust.html

24. Song Maoa, Azriel Rosenfelda, Tapas Kanungob, “Document Structure Analysis Algorithms: A Literature Survey”, 2003, Available at http://archive.nlm.nih.gov/pubs/mao/mao03.pdf

25. Song Mao and Tapas Kanungo, “Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms”, IEEE Transactions on Pattern Analysis and machine intelligence, vol. 23, no. 3, 2001. Available at http://lhncbc.nlm.nih.gov/lhc/docs/published/2001/pub2001008.pdf

26. S. Mao and T. Kanungo, “Empirical performance evaluation of page segmentation algorithms”, Proceedings of SPIE Conference on Document Recognition and Retrieval VII, vol 3967, pages 303-314, 2000.Available at http://citeseer.nj.nec.com/mao00empirical.html

27. Richard Zanibbi, Dorothea Blostein, James R. Cordy, “A Survey of Table Recognition: Models, Observations, Transformations, and Inferences”, International Journal of Document Analysis and Recognition, accepted November 2003. Available at http://www.cs.queensu.ca/home/zanibbi/files/IJDAR_Tables.pdf

28. George R. Thoma, Glenn Ford, “Automated data entry system: performance issues”, Proceedings of SPIE vol. 4670, 2002, Available at http://archive.nlm.nih.gov/pubs/thoma/spie2002mars/spie2002mars.pdf

29. K. Y. Wong, R. G. Casey, F. M. Wahl, “Document Analysis System”, IBM Journal of Research and Development, vol 26, no 6, pages 647-657, 1982. Available at http://www.research.ibm.com/journal/rd/266/ibmrd2606B.pdf

30. Zhixin Shi, Venu Govindaraju, “Skew Detection for Complex Document Images Using Fuzzy Runlength”, Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol 2, pages 715-720, 2003. Available at http://www.cedar.buffalo.edu/~zshi/Papers/skew.pdf

31. Thomas G. Kieninger, “Table structure recognition based on robust block segmentation”, Proceedings of the Document Recognition V, SPIE, vol 3305, pages 22-32, 1998. Available at http://citeseer.nj.nec.com/kieninger98table.html

32. History of paper, Available at http://www.paperonline.org/history/history_frame.html

33. Preston Chesser, “The Burning of the Library of Alexandria”, Available at http://www.ehistory.com/world/articles/ArticleView.cfm?AID=9

Page 17: Document classificationmhd/8331f04/halas.doc  · Web viewIn [6] word height, character width, horizontal word spacing, line spacing and line indentation are used for document clustering.

34. G. Nagy, “Twenty years of document image analysis in PAMI”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol 22, no 1, pages 38–62, 2000. Available at http://www.ecse.rpi.edu/homepages/nagy/PDF_files/Nagy-20yearsDIApami.pdf

35. Yuan Y. Tang, M. Cheriet, Jiming Liu, J. N. Said, Ching Y. Suen, “Document Analysis And Recognition By Computers”, Handbook of Pattern Recognition and Computer Vision (2nd Edition), 1999. Available at http://citeseer.nj.nec.com/tang99document.html

36. R. Cattoni, T. Coianiz, S. Messelodi, and C. M. Modena, “Geometric layout analysis techniques for document image understanding: a review”, Technical report, IRST, Trento, Italy, 1998. Available at http://citeseer.nj.nec.com/article/cattoni98geometric.html