Pink Panther: A Complete Environment for Ground-truthing and ...

Pink Panther: A Complete Environment for

Ground-truthing and Benchmarking Document Page

Segmentation 1

Berrin A. Yanikoglu a and Luc Vincent b

aIBM Almaden Research Center, 650 Harry Road, San Jose, CA [email protected]

bXerox Desktop Document Systems, 3400 Hillview Avenue, Palo Alto, CA [email protected]

We describe a new approach for the automatic evaluation of doc-ument page segmentation algorithms. Unlike techniques that relyon OCR output, our method is region-based: segmentation qualityis assessed by comparing the segmentation output, described as aset of regions, to the corresponding ground-truth. Error maps areused to keep track of all the errors associated with each pixel, re-gardless of the document complexity. Misclassi�cations, splitting,and merging of regions are among the errors detected by the sys-tem. Each error can be weighted individually and the system can becustomized to benchmark virtually any type of segmentation task.

1 Introduction

Page segmentation is the processby which a document page image isdecomposed into its structural andlogical units (regions or zones), suchas images, paragraphs, headlines, ta-bles, etc. This process, also referredto as zoning, layout analysis, or pagedecomposition, is critical for a varietyof document image analysis appli-cations. Once the regions are found(zoning) and their types identi�ed(labeling), they may also need to be

1 This work was done when the authorswere at Xerox Desktop Document Sys-tems, Peabody, MA.

ordered (ordering) according to thenatural reading order(s) of the page.Region ordering is sometimes consid-ered part of page segmentation itself.

Document recognition systems relyheavily on page segmentation in or-der to \understand" the structureof a document and ultimately beable to recompose it in a word pro-cessing environment. Segmentationprovides the necessary coarse-levelunderstanding of the document: dis-criminating between text and graph-ics, separating a column of textfrom an adjacent one etc. Further-more, reading machines for the blindneed accurate page segmentation to

Preprint submitted to Elsevier Preprint 8 September 1997

be able to read the text in multi-column documents in a correct order.Even digital copiers and related soft-ware/hardware systems rely on doc-ument segmentation techniques tocorrectly identify the various types ofregions present on a page (e.g., text,continuous tone images, halftones,line-art), and render them appropri-ately[1].

Until recently, OCR accuracy wasthe only aspect of document recog-nition systems that was commonlybenchmarked. However, nowadayssophisticated document recognitionsystems are handling very complexdocuments such as newspapers, mag-azines, and junk mail, and thus itis becoming increasingly importantto be able to benchmark page seg-mentation as well. A segmentationbenchmarking tool would be help-ful in deciding between commercialsystems for a given application, aswell as enabling developers of pagesegmentation systems to easily testtheir algorithms.

Several di�erent page segmentationalgorithms have been developed inthe past couple of decades, some de-scribed in literature [2] and someproprietary [3]. Among the manydi�erent approaches to the prob-lem are rule-based systems[4], useof connected component bounding-boxes[5], and analysis of backgroundand white streams information[6{8].Some of these algorithms are speci�-cally designed to work on particulartypes of document, whereas othersare meant to be completely gen-eral. Furthermore, some algorithmsare designed for very speci�c sub-

tasks of what is generally referred toas page segmentation: for example,some may simply detect the graph-ics in a document page image, whileothers may only care about the text.Some algorithms generate coarse seg-mentations (e.g., at the galley level)while others decompose pages downto the paragraph level, or even downto the line level. For some applica-tions, a page segmentation algorithmmay need to distinguish betweenhalftones and continuous tones, forothers, the distinction is irrelevant.

Faced with such a diversity of meth-ods and such a wide variety of goals,a question arises: how should thequality of a given segmentation algo-rithm for a particular segmentationtask be assessed? The accuracy ofa page segmentation system is typi-cally evaluated by running the sys-tem on a large number of documentimages and \eyeballing" the results,a tedious and subjective process.Benchmarkingof document page seg-mentation is further complicated bythe various ways segmentation mis-takes can be handled. For instance,if the segmentation mistakes will becorrected manually (i.e., re-zoningparts of the document), one maywant to evaluate the segmentationperformance in terms of how manyregions need to be re-zoned to cor-rect the segmentation. On the otherhand, if the segmentation output isto be directly fed to the OCR sys-tem, one might be concerned withthe amount of images that are classi-�ed as text or the amount of mergedtext columns etc.

In this paper we describe a complete

2

environment, called Pink Panther,for creating segmentation ground-truth �les and benchmarking pagesegmentation algorithms. The per-formance of a given page segmen-tation system is evaluated by run-ning the system on a large numberof document images and compar-ing the output for each documentto the corresponding, previously cre-ated ground-truth. We have devel-oped a ground-truthing tool, calledGroundsKeeper, for easily creatingthe ground-truth data, as explainedin Section 3. Both GroundsKeeperand the benchmarking tool (Cluzo)can be tailored to a variety of ground-truthing and benchmarking needs,respectively.

The paper is organized as follows.In the next section, we describethe previous work in this area. Sec-tion 3 describes our ground-truthingmethodology, the tool we developedto create segmentation ground-truth�les, and the �le format used to rep-resent the ground-truth information.In Section 4, we describe the bench-marking process. Finally, we showhow the system can be customizedto a wide variety of applications.

2 Background and Previous

Work

Two approaches have been proposedin literature for automatic bench-marking of page segmentation. The�rst one, developed at the Universityof Nevada in Las Vegas (UNLV), ispurely text-based[9{11], whereas theapproach we have chosen is region-

based[12,13]. These two approachesare brie y described below, and weexplain why we believe our region-based approach o�ers signi�cant ad-vantages over text-based approaches.

2.1 Text-Based Approaches

The Information Science ResearchInstitute (ISRI) of the University ofNevada in Las Vegas has designed aninteresting technique to evaluate thequality of page segmentation algo-rithms working within OCR softwarepackages. Brie y, this method worksas follows:

First, page segmentation (includingregion ordering) and character recog-nition subsystems of an OCR systemare applied to the document page,and the result is output as an ASCIIstring.

Then, using string matching algo-rithms[14], the number of insertions,deletions, and block moves neces-sary to convert this output into theideal ground-truth string while min-imizing the overall cost, are deter-mined. The overall cost is computedas C = ni � ci + nm � cm, where ciis the cost of an elementary insertionoperation, cm is the cost of a blockmove, and ni and nm are the numberof insertions and moves, respectively.The cost of a deletion is set to zero(for reasons explained below).

Finally, in order to �nd the portion ofthe overall cost that is due to segmen-tation mistakes alone, the cost of er-rors that are purely due to character

3

recognition mistakes are subtractedform the overall cost, C. This cost isdetermined by running the OCR sys-temon the same page, this time usingmanual segmentation.

One of the main points in favor ofthis technique is that it is purelytext-based, and therefore does notrequire the page segmentation sub-system to use any particular outputformat. In addition, although its un-derlying string-matching algorithmsare rather elaborate, the overall ap-proach is fairly straight-forward andthe ground-truth �les for use withthis approach (text ground-truth)are very easy to create. Therefore,the UNLV zoning evaluation systemhas been well accepted by the docu-ment recognition community.

Nonetheless, this system has somelimitations listed below:

{ The system can only deal withtext regions: the measured accu-racy only re ects the accuracy onsegmenting text zones, and thesegmentation of document imagescontaining no text cannot be eval-uated.

{ The output is merely a set of num-bers (number of insertions, blockmoves, and perhaps deletions). Ittherefore provides very little infor-mation on the types of mistakesthat were actually made (whetherregions were split or merged, im-ages mistaken for text, or headlineregions segmented less correctlythan regular text regions, etc.).Such information would be veryvaluable in improving segmenta-tion algorithms.

{ Only one zone order (or at mosta small set of zone orders) is con-sidered correct. If a zoning algo-rithm produces a perfect segmen-tation, but does not output regionsin one of the few chosen orders,it is penalized. This is too rigid,since in many cases, region orderis not uniquely de�ned (see Sec-tion 3.1). For example, how shouldsuch text regions as captions, head-ers, footers, or insets, be ordered?In an ideal benchmarking system,all possible reading orders shouldbe considered equally correct.

{ The current metric only takes in-sertions and block moves into ac-count, while deletions are assumedto have zero cost. The reason forthis is to avoid penalizing systemsthat recognize \too much", such asrecognizing text inside images thatdo not appear in the ground-truth.This means that an algorithm willnot be penalized for classifyingnoise regions as text, or worse, fordetecting images as text.

{ The method requires a zoning al-gorithm to be part of an OCR sys-tem, in order to be able to bench-mark it; hence, stand-alone seg-mentation algorithms cannot bebenchmarked. Furthermore, thezoning subsytem of an OCR sytemcan not be evaluated on segment-ing documents in languages thatare not supported by that OCRsystem.

{ In order to be able to \subtract"the character errors from the totalcost, the character recognition per-formance needs to be independentof the segmentation performance.This is often not true: a \messy"segmentation generally results in

4

poor character recognition perfor-mance (e.g., if the segmentationsplits or joins two galleys, the hy-phens are incorrectly paired andthe OCR engine can lexically ver-ify fewer words). Alternatively, apoor OCR engine might \hide"some of the segmentation errorswhen the errors overlap.

2.2 Proposed Region-Based Ap-proach

For the reasons listed above, we havetaken a region-based approach tobenchmarking page segmentation:the segmentation quality is evaluatedby comparing the segmentation out-put and the corresponding ground-truth which are both ASCII �les de-scribing the regions on the page. Aregion is represented as an arbitrarypolygon (the outline of a zone) to-gether with various attributes, suchas its type, subtype and parent zone,as explained in Section 3.1.

To �nd the correspondence betweenthe segmentation and the pre-storedground-truth, �rst the overlap be-tween the segmentation regions andthe ground-truth regions are deter-mined. This is done using the re-gions' ON-pixel contents instead oftheir polygonal representations: eachregion is modeled as the set of ON-pixels its associated polygon con-tains. This way, the detailed shapeof region polygons is not taken intoaccount, and insigni�cant di�erencesbetween ground-truth and segmen-tation polygons are automaticallyignored.

By analyzing the region correspon-dences, one can detect such segmen-tation problems as missed, merged,split or misclassi�ed ground-truthregions, and extraneous segmenta-tion regions. This overall scheme ise�ciently implemented through theuse of region maps and error maps,as described in Section 4. As shownlater, this system is more complexand cumbersome to use than a text-based system, but o�ers greater ex-ibility and avoids most of the limita-tions of the previous approach.

3 Creation of Ground-Truth

Files

A reasonably large number of docu-ment images and their segmentationground-truth is needed to evaluatea segmentation algorithm. An X-windows tool, calledGroundsKeeper,was developed to speed up the cre-ation of ground-truth �les. This tool,described in Section 3.2, allows theuser to view a document image, zoneit with simplemouse clicks, and spec-ify information about each zone, aswell as the ordering relationships be-tween zones. The zoning informationcan then be saved in a simple ASCIIformat, called RDIFF (Region De-scription Information File Format),as the ground-truth �le.

3.1 Representation of Ground-Truth Information

The RDIFF format used to repre-sent page segmentation information

5

(ground-truth or segmentation) isa simple ASCII format containingsome header information such as thename of the associated document im-age, the resolution, and the page size,followed by a list of region descrip-tions. Each region has an associatedtype (e.g., text, image), and subtype(e.g., headline, caption), may referto a parent region, and may have anarbitrary number of attributes (e.g.,reverse-video). Region shape itselfis an arbitrary polygon, speci�edby the coordinates of its vertices.The parent zone information is usedto specify hierarchical information.For instance, cells inside a table arezoned individually and point to thetable (zoned as a whole) as their par-ent region. Similarly, any text insidea halftone is zoned separately andpoints to the halftone as its parentregion.

Even though the RDIFF format de-�nes a type and a subtype for aregion, the type of a given subtypecan be rede�ned for a given bench-marking task. For example, a logo isde�ned as an image subtype by de-fault, however, if a particular appli-cation wanted to treat logos as textregions and evaluate the segmenta-tion accordingly, it could do so byoverriding the type of a logo for thebenchmarking process (see sections3.2 and 4.1).

Some of the features of the RDIFFformat are as follows:

{ It is a tag-value based format: eachentry in the format starts with atag, followed by the value of thistag. The advantage of this ap-

proach is its exibility: an RDIFFreader typically only knows aboutthe tags that are relevant to itspurpose and ignores the rest. Inaddition, even as new tags areadded with each new release of theformat, the new RDIFF �les re-main backwards compatible withpreviously written RDIFF readers.

{ It can easily be expanded. For in-stance, region shape is currentlyan arbitrary polygon. This is su�-cient inmost applications, howeverif it became important to supportregions with holes, or disconnectedregions, new region representa-tions (e.g., lists of polygons) couldbe added and a new tag could beused to specify the representationscheme chosen for each region (thedefault scheme being the singlepolygon).

{ It imposes no restrictions on regionlocations: regions can overlap, beincluded in one another, etc.

An example of RDIFF �le is given inFigure 3.1.

Partial Region Ordering

Note the R ORDER RULE at the endof the RDIFF �le shown in Figure3.1. This rule represents a set of or-dering constraints between regions,where X < Y indicates that regionX precedes region Y. This providesus with a partial ordering of thezones describing the necessary or-dering relationships among the re-gions. As mentioned in Section 2.1,one complete ordering of all regionsis too strict, and instead, a num-

6

BEGIN_IMAGE

FILENAME /data/tif/h/ht21.tif

IMAGE_WIDTH 2560

IMAGE_HEIGHT 3300

IMAGE_XRES 300

IMAGE_YRES 300

END_IMAGE

BEGIN_SUMMARY

INIT_FILE_VERSION 1.8 1995/04/18

TOTAL_REGIONS 25

TEXT_REGIONS 22

IMAGE_REGIONS 3

END_SUMMARY

BEGIN_REGIONS

R_NUMBER 1

R_SUBTYPE HALFTONE

R_ATTACHMENT 0

R_PARENT 0

R_DRAW_TYPE Box

R_NAME Zone1

R_ATTRIBUTES SHADED_BACKGROUND

REGION_POLYGON 301 692 71 681

...

R_NUMBER 9

R_SUBTYPE CAPTION

R_ATTACHMENT 1

R_PARENT 0

R_DRAW_TYPE Box

R_NAME Zone9


...

R_NUMBER 19

R_SUBTYPE REG_TEXT

R_ATTACHMENT 0

R_PARENT 0

R_DRAW_TYPE Constrained-Polygon

R_NAME Zone19


COUNT 6

774 428

982 428

982 1086

...

R_ORDER_RULE 18<19<20<21<22<23

Fig. 1. Sample RDIFF �le.

ber of ordering constraints (or rules)should be used to represent all possi-ble region orderings of a document.This way we can, for example, or-der regular text regions and �guresin two separate orders, using multi-ple R ORDER RULE tags. Fig. 2 showsan example where several orderingconstraints are used to order �guresbelonging to each column amongthemselves, and to enable either ofthe two possible reading orders ofthe last four regions. Partial orderingthus makes it possible to deal withambiguous cases by specifying onlythe necessary constraints.

Even a partial ordering is not alwayssu�cient to fully and unambiguouslyrepresent the relationships betweenzones of a complicated document. Insome cases, two regions are not or-dered with respect to each other butare logically associated (e.g. �guresand their corresponding captions). Inthis case, the caption is \attached"to the image. Attachements are spec-i�ed with the R ATTACHMENT tag inthe RDIFF language. The di�erencebetween attachments and orderingsis that, for instance, even though acaption does not have to come beforeor after the �gure it belongs to, if the�gure is moved the caption should bemoved with it. Currently, region or-dering accuracy is not benchmarked.

3.2 GroundsKeeper

To create test suites of segmentationground-truth �les in the RDIFF for-mat, we developed an X-Windowsbased tool, called GroundsKeeper

7

R2

R3

R1 R5

R4

R6

R7

R8

R9

R10

R11

R12

R13

Fig. 2. The order between regions in thissample page can be speci�ed using �vepartial orders:R2 < R3,R6 < R7 < R8,R1 < R4 < R5 < R9 < R10; R10 < R11 < R13,and R10 < R12 < R13. Note that thelast two constraints indicate that theregions R10 : : :R13 can be read in oneof two ways.

(see Fig. 3). GroundsKeeper allows auser to view a document image anddraw zones of various types aroundthe di�erent page regions, using sim-ple mouse clicks. The zones can bedrawn using one of the followingdrawing methods: rectangles (suf-�cient for most cases), constrainedpolygons (polygons whose edges arevertical or horizontal), and arbitrarypolygons (useful for complex layouts,or skewed pages).

After drawing a zone, one can label itwith its type, subtype, parent zone,

attached zones, and any number ofattributes. Types, subtypes, and at-tributes shown in menu selections,are de�ned in a start-up �le: theyare therefore completely customiz-able for any particular application.For instance, we use the start-up �leshown in Fig. 4 for our segmentationground-truthing. Note that we choseto create fairly detailed ground-truth�les, since one can always use lessinformation than these �les contain.

Each new zone is given a uniqueidentifying number, displayed on thescreen, and can be given a name (de-rived from its number by default).Once created, zones can be deleted,moved, or modi�ed. Furthermore,previously created ground-truth �lescan be read and updated.

Region ordering rules (sequences) arespeci�ed by clicking on regions, oneafter the other; a sequence is then dis-played as a succession of arrows onthe screen. Multiple region orderingrules can be created in any order bystarting a new sequence each time.

By default, the image is shown withthe regions and the sequences drawnover it, but one can also displayonly the regions, the regions andsequences, or the image alone. Theability to display the image and theregions in di�erent zoom levels is alsoavailable. In fact, a good way to useGroundskeeper is to �rst draw the re-gions coarsely at low-resolution, andthen zoom-in and adjust them.

Groundskeeper has a number of addi-tional features designed to make thecreation of RDIFF �les easier, such

8

as the inheritance of region infor-mation. For instance, when zoningmultiple cells inside a table, the userneeds to input the region subtype(cell) and the parent (the table) in-formation only once. The cells zonedsubsequently would inherit this in-formation by default, signi�cantlyspeeding up the zoning process. Infact, for most pages, ground-truthingtime is around �ve minutes.

Using Groundskeeper we have cre-ated half a dozen test suites ofground-truthed documents (maga-zines, yers, newspapers, businessletters, fax, tables), for a total ofabout 800 RDIFF �les.

9

Fig. 3. GroundsKeeper window showing a document image together with someground-truth regions created. Ordering constraints between regions are shown asarrows.

10

Creating segmentation ground-truth is not always straight-forward.A ground-truthing handbook wasthereforewritten in which potentiallyambiguous cases were described, inorder to take as much subjectivityout of the ground-truthing processas possible. One of the decisions wehave faced was the necessary level ofdetail; we have decided that, withinlimits, zones should be logically andstructurally uniform, and that para-graphs were the smallest units of in-terest of regular text (i.e., we do notzone lines or characters separately).Figure 4 shows all the possible sub-types and attributes currently usedin ground-truthing.

Furthermore, in order to remove themain source of ground-truthing am-biguity, which is deciding on the la-bel (subtype) of a region, we haveexpanded GroundsKeeper to allowmultiple region labels for a given re-gion. However, the benchmarking al-gorithm currently uses only the �rstlabel of a region.

4 Our Segmentation Bench-

marking Algorithm

The performance of a given pagesegmentation system is evaluated byrunning the system on a large num-ber of document images, organizedin di�erent test suites (e.g. maga-zine articles, yers, business letters).Then, the output of the segmenta-tion algorithm for each document iscompared to the corresponding, pre-viously created ground-truth, andvarious types of segmentation errors

such as misclassi�cations of regiontypes, missed pixels, and splittingand merging of regions, are detected.

As mentioned before, our approachis region-based: the segmentationoutput describing the regions on thepage, is compared to the correspond-ing ground-truth of the same form.A region-based approach was pre-viously thought as infeasible due tothe lack of standardization of regionrepresentation schemes (e.g., boxes,polygons), which makes it di�cult to�nd the correspondence between theground-truth and the segmentationregions. This problem is handled bycomparing the regions not by theirshapes but their ON-pixel contents.

Hence, two regions that enclose thesame image area (e.g. a headline) arecorrectly considered identical regard-less of how tightly they enclose thearea or what region representationsschemes are used.

In the rest of the paper, S and Grefer to the set of segmentation andground-truth regions, respectively,and we use the word \overlap" to de-note an overlap in ON-pixel contents.

Region Maps

In order to e�ciently deal with re-gions as sets of ON-pixels, the �rststep in our system is to create regionmaps, one for the ground-truth andanother for the segmentation regions.Each map is a reduced-resolutionrepresentation of the original docu-ment image, in which each pixel is

11

tagged to indicate all the region(s) itbelongs to.

In building the region maps, onehas the choice of allowing for regionoverlaps or not: regions can be ren-dered while keeping track of theiroverlaps as described below, or theycan be rendered one after the otherwithout worrying about overwritingpreviously rendered pixels. Apartfrom the global speci�cation abouthow overlapping regions should berendered, one can also de�ne someregion subtypes to be hollow, so thatthey are rendered to contain onlythose pixels that are not included inany other region (see Section 4.1).

To keep track of the overlaps, thesystem requires that region numbers(i.d.'s) are all positive, and uses neg-ative numbers to indicate the over-laps. Speci�cally, each ON-pixel isassigned the number of the regionit belongs to; if this pixel belongsto more than one region, it is givena unique negative number that rep-resents the particular set of regionsthat are overlapping at that pixellocation. A hypothetical example isshown in Figure 5, where, for in-stance, label -4 corresponds to theoverlap between regions 1 and 3 only,whereas label -6 corresponds to theoverlap between regions 1, 3 and 4.

Region maps provide a compact ande�cient way of handling arbitrarilycomplex documents where several re-gions might overlap.

Finding Region Correspondences

Using region maps, the amount ofoverlap between each segmentationregion and each ground truth regionis �rst determined. This is done byscanning the two region maps, foreach region pair in S � G whosebounding boxes overlap.

After �nding the overlaps, the ac-tual correspondences between thesegmentation and ground-truth re-gions are determined. An overlapdoes not necessarily indicate a cor-respondence: a segmentation regionoverlapping with two ground-truthregions (one enclosed in the other)corresponds to only one of them. Forinstance, in Fig. 6.a, the dashed seg-mentation region corresponds to theenclosing image region (the text re-gion is missed in the segmentation),whereas in Fig. 6.b, the segmentationregion corresponds to the enclosedtext inside the image (the image re-gion is missed).

Since two ground-truth regions over-lap only if one is enclosed in theother, deciding on which ground-truth region to assign the overlap-ping segmentation region is done bycomputing a simple match score be-tween the overlapping ground-truthand the segmentation regions. Fora ground-truth region g and thesegmentation region s, the score isde�ned as the percentage of the ON-pixels of the ground-truth regioncovered by s minus the percentage ofthe ON-pixels of s outside of g, as:

match(g; s)= overlap(g; s)=area(g)

12

� outside(g; s)=area(s)

For instance, in Fig. 6.a, the segmen-tation region S1 fully covers both G1and G2, but because it has a signif-icant number of ON-pixels outsideG2, its match score to G2 is low. Onthe other hand in Fig. 6.b, S1 coversonly a small portion of G1 while fullycovering G2 and it does not containany ON-pixels outside G1 or G2, itsmatch score to G2 is high.

Region Alignments

When a ground-truth region corre-sponds to more than one segmenta-tion region, the alignment of the seg-mentation regions is analyzed. Tworegions are called vertically aligned ifthey do not overlap along the x-axis,otherwise the alignment is calledhorizontal alignment (see Figure 7).Similarly, when a segmentation re-gion overlaps with more than oneground-truth region, the alignmentof the ground-truth regions is ana-lyzed.

From region overlaps and the align-ment types, the occurrence of vari-ous segmentation errors (e.g., missedor horizontally merged ground-truthregions) can be detected. The imagepixels a�ected by each error and theassociated costs are then determinedby further analysis, as explained be-low.

Detecting Segmentation Errors

After �nding the region correspon-dences and analyzing their align-ments, error analysis is performed todetect the occurence of various kindsof segmentation errors, their loca-tions, and the associated penalties.The segmentation errors that arecurrently detected are the following:

{ noise region{ missed ground-truth region{ vertically split ground-truth region{ horizontal split ground-truth re-gion

{ vertically merged ground-truth re-gion

{ horizontally merged ground-truthregion

{ mislabeled pixels

A noise region is a segmentation re-gion that does not overlap with anyground-truth regions: it is incorrectlydetected as a valid region. Similarly,a missed region is a ground-truthregion that does not overlap withany segmentation regions: it is in-correctly thought to be a noise (in-valid) region. On the other hand, ifno single segmentation region cov-ers a given ground-truth region, thatground-truth region is called split.The alignment of the segmentationregions indicate whether the split isa horizontal or a vertical one. On thecontrary, if a segmentation regionmatches more than one ground-truthregions, it ismerging them.As above,the alignment of the ground-truthregions indicate whether the mergeis a horizontal or a vertical one.

13

The severity of these segmentationerrors varies from application to ap-plication. For instance, if segmenta-tion is to be used in the context ofan OCR system, vertical merging oftext regions is usually not a big mis-take, especially if the merged regionsappear consecutively in the readingorder. On the other hand, horizon-tal merging of text regions should bestrongly penalized. For halftone re-gions, vertical and horizontal mergesare usually equally bad.

Estimating Error Costs

When a segmentation error is de-tected, the associated image area isanalyzed (using the region maps) todetermine the cost of the mistake.For an objective and fair evaluationof the segmentation, one should not ag an entire region as \bad" if ithas only few erroneous pixels. Forinstance, if a ground-truth regionmatches only one segmentation re-gion but few of its pixels are notcovered, it would be too severe tomark the entire region as being split.In order to di�erentiate between theseverity of a split involving few pix-els (very common on regions wherethe boundaries are not well de�ned)and one that involves most of theregion, the severity of the mistakeneeds to be assessed in terms of thenumber of ON-pixels involved. Forinstance, when a ground-truth re-gion is merged, our system normallyconsiders only the area that is actu-ally merged (not the entire region)as erroneous, as shown in Fig. 8.

The exact area (shaded in Fig. 8)where the segmentation error oc-curs is found by scanning the tworegion maps. For instance, when aground-truth region, g, is split, eachline (row) is scanned to see if thereis a corresponding segmentation re-gion covering that entire line. If not,it means that the line is split andall the pixels on it are marked assuch. Similarly, each line of a merg-ing segmentation region is scannedto see whether there is a correspond-ing ground-truth region covering allthe pixels on that line. If not, allthe pixels on that line are marked asmerged.

Alternatively, the user could penal-ize the entire region, regardless ofthe size of the actual error area. Themethod for computing error costs areindicated in a start-up �le, as one ofthe following alternatives:

{ unit cost: the cost of an erroneousregion is 1 regardless of its size.This makes sense when segmenta-tion is to be manually corrected,in which case the main thing thatmatters is the number of regions tobe re-zoned, not their size. Whenthe unit-cost-method is used, thetotal cost is normalized by thenumber of ground-truth regionson the page.

{ cost proportional to the height orsize (ON-pixel count) of the er-roneous area. If the segmentationoutput is to be directly fed tothe OCR system, it is more ap-propriate to compute the cost inproportion to the height or sizeof the erroneous area. One couldeven penalize images by size and

14

text by height. For height and sizebased methods, the total cost isnormalized by the number of linescontaining ON-pixels, and thenumber of ON-pixels on the page,respectively. The default error costis the size of the erroneous area.

The regions are currently analyzedalong the horizontal scan lines be-cause most English text is writtenin that direction (hence, horizontalsplits or merges are severe). In orderto handle documents that might con-tain vertically written text, it wouldbe necessary to also include the di-rection of writing inside a region inthe ground-truth �le and scan for er-rors along this direction. This will beaddressed in future versions.

Weighting and Combination of Er-rors

When a segmentation error is de-tected, how should it be weightedwith respect to other errors? Howshould errors be combined if they af-fect the same ground-truth region?For example, if (part of) a ground-truth region is found to have beensplit twice by segmentation, shouldthis be considered twice as bad? Sim-ilarly, if this region is found to havebeen both merged and split, shouldone of these errors have precedenceover the other one, or should botherrors be counted?. . . The answersto all these questions depend on thetype of segmentation being bench-marked, and can be speci�ed in thestart-up �le used by our benchmark-ing tool (see next section).

To deal with all the possible waysone may want to weight and combineerrors, we use a scheme based onerror maps (same concept as the re-gion maps). Using an error map, wecan keep track of and penalize all theerrors associated with a given pixelin the image (e.g., an erroneous splitand an erroneous merge), or only themost serious one. The weight of anerror, indicating the severity of theerror, is speci�ed in the start-up �le.One can specify, for instance, theweight of mislabeling a text regionwith shaded background as halftone.Hence, in addition to indicating whatthe cost should be proportional to(unit, or by height or by size), thestart-up �le can indicate which errorsto take into account (all or the mostserious), and what the weight for agiven error should be. The weightsof the errors are used to determinewhether a particular error should bepenalized (positive weight) or not,and how severe it is with respect tothe other errors.

After all errors have been detected,the segmentation quality is charac-terized using the costs of each ofthe error types (e.g., noise, splits,merges) and the overall page cost.The system also reports the costassociated with each speci�c error(e.g., reverse-video text recognized ashalftone) that has a non-zero weightin the start-up �le. Hence, one caneven vary the amount of informa-tion output by the system about thesegmentation performance.

As an example, consider the set ofground-truth regions shown in Fig. 9:eight text regions and two image re-

15

gions are present. The double-spacedtext was zoned into �ve di�erentparagraphs according to our ground-truthing guidelines, but for most ap-plications, zoning it into two galleyswould have been su�cient. Hence,in the start-up �le, the cost of verti-cally merging text regions were set tozero. Fig. 10 shows the output of theautomatic segmentation algorithmwhich mistakenly combined two gal-leys into one. Furthermore, the textpart of the logo was split and joinedto the merged text galleys. The cor-responding error map created by ourbenchmarking algorithm is shown inFig. 11. The black pixels in the mapshow the horizontally merged text ofthe two galleys and the horizontallysplit pixels of the logo.

Note that we consider all the pixels(ON or OFF) between the �rst andlast ON-pixels on a line as de�ningthat line, and mark them as erro-neous, if the line is found as erro-neous. This not only increases theweight of the text areas with respectto the dense image areas, but alsolessens the di�erence between theweights of boldface characters andregular ones.

The output of the benchmarking forthe example shown in Figures 9 and10 is given in Fig. 12. Zoning errorsand labeling errors are listed sep-arately, as they belong to di�erentaspects of the segmentation process.The start-up �le used to produce thisoutput, speci�ed six di�erent typesof mislabeling (e.g., mislabeling textas image, confusion of image sub-types), but only one type of verticalmerge, horizontal merge, etc. Also

recall that the cost of vertical mergesand splits was set to zero.

In this example the overall percent-age of bad pixels is found to be about25% with the particular settings ofthe start-up �le, when about twothirds of the document is merged.However, because the analysis wasbased in this case on region sizes(ON-pixel counts), the correct zon-ing of the halftone skewed the re-sults favorably. One way to achievean overall cost that would re ectour \eyeballing" judgement wouldbe to to base the analysis on regionheights.

4.1 Customizing the Benchmarking

Just like GroundsKeeper, our bench-marking tool Cluzo can be cus-tomized to a particular application.The start-up �le required by theprogram is decomposed into severalsections:

{ Listing of all the region types, sub-types, attributes etc. that can beencountered in RDIFF �les.

{ The type (text or image) of eachregion subtype. For instance, thetype of the subtype LOGO can beset to IMAGE or TEXT, depend-ing on the application.

{ Description of region equivalen-cies: in this section, a user canspecify, for instance, that subtypeA is in fact equivalent to subtypeB, and that subtype D is equiv-alent to subtype B as well. Forexample, if the user does not careabout distinguishing between dif-

16

ferent subtypes of image regions(e.g., graphics, halftones, line-art)and text regions (e.g., captions,headlines, footers), the total num-ber of region subtypes could bereduced to two. This is also use-ful in mapping the subtype namesused in the segmentation outputto those used in the ground-truth(e.g., TIME-STAMP to TIMES-TAMP).

{ Listing of regions to ignore. Withthis feature, it is possible to com-pletely exclude certain region sub-types (e.g., frames) for the bench-marking process.

{ Listing of hollow regions. The re-gions de�ned as hollow will beassigned only those pixels in-side them that do not belong toother regions (e.g., the subtypeFRAMEBOX which is the sub-type of frames around an imageor text area, is declared as hol-low, and therefore only containsthose pixels that actually formthe frame). This feature not onlymakes ground-truthing of hollowregions easy, but also extends thepossible region shape to an arbi-trary polygon with holes!

{ Listing of all the di�erent errortypes, together with their weights.This can be done for generic re-gions as well as for particular re-gion types. For example, the start-up �le we currently use containsthe following:

Weight_of Vertically_split

REGION 0


GRAPHICS 3


HALFTONE 3

This speci�es that the weight ofa vertical split is 0, except when itinvolves a GRAPHICS region or aHALFTONE region, in which casethe weight is 3.A more complicated example is:

Weight_of Mislabeled

REGION 2

Weight_of Mislabeled

REG_TEXT with-attribute

SHADED_BACKGROUND

as HALFTONE 1

This indicates that the weightof misclassifying a (any) region is2, and the weight of misclassifyinga regular text region with shadedbackground as halftone is 1 (lesssevere). As mentioned before, thesystemwill output the speci�c costof misclassifying regular text re-gions with shaded background ashalftone, in addition to the moregeneral errors.Weights speci�ed later in the

start-up �le overwrites the pre-vious ones; this makes it easy tospecify the weights, from moregeneral to more speci�c. For in-stance in the above example, themislabeling cost for the genericregion (keyword \REGION") isspeci�ed �rst, and then the excep-tions are indicated.

{ What error costs should be pro-portional to (unit, height, size). Bydefault, the cost of an error is thenumber of pixels involved in thaterror.

{ De�nition of the error combinationmethod: when an area of the pageis involved in multiple segmen-tation mistakes, should all thesemistakes be taken into account,or should only the most serious

17

mistake (the one with highest as-sociated weight) prevail?

5 Conclusions, Future Work

We described the Pink Panther au-tomatic page segmentation bench-marking environment. Pink Pan-ther consists of two separate parts:GroundsKeeper, an X-windows toolfor creating segmentation ground-truth �les, and Cluzo, the segmenta-tion benchmarking system.

Our approach to benchmarking seg-mentation is region-based: segmenta-tion quality is assessed by comparingthe segmentation output, describedas a set of regions, to the correspond-ing ground-truth. This approach en-ables us to benchmark segmentationperformance directly, unlike text-based techniques that rely on theOCR output.

The benchmarking system can beused to benchmark various di�er-ent segmentation tasks and is ableto keep track of a large number oferror types, providing very detailedinformation on the segmentationquality. In addition to reporting thelocation, type and severity of all thesegmentation mistakes on the page,the system also computes an overallcost from the individual costs of themistakes, according to the speci�ca-tions in the start-up �le. The relativeweights of various error types andhow they should be combined, areindicated in the start-up �le by theuser. Another powerful aspect of thesystem is its ability to deal with mul-

tiple region orderings, which is im-portant to properly handle complexdocument structures. The systemhas been developed with exibilityin mind, considering all the di�erentways this system could be used inpractice. This makes it a compellingalternative to text-based segmenta-tion benchmarking approaches.

Note that by penalizing errors onlyby the number of pixels associatedwith that error (and not the wholeregion), small zoning mistakes in theground-truth, that might be invisi-ble in low resolution, become insignif-icant. Using partial region orderingalso removes some of the potentialground-truthing ambiguity.

The system is now available on theDIMUND (Document Understand-ing and Character Recognition) listweb site (http://documents.cfar.umd.edu).

In future versions of our benchmark-ing environment, we are planning onextending the benchmarking envi-ronment to handle documents wherethe direction of the writing can bevertical as well as horizontal, such asin Chinese and Japanese documentsand in some captions and sidebarsin English documents (often withoutupright letters).

Another extension to our bench-marking system that is very signif-icant is the capability of handlinggrayscale and color documents: weare currently adding this toGrounds-Keeper and considering a few optionsfor extending Cluzo to benchmarkgrayscale and color images as well.This will potentially open a whole

18

new range of applications beyonddocument analysis, such as in medi-cal imaging, for our benchmarking.

References

[1] Ying-Wei Lin. Digital imageprocessing in the Xerox docutechdocument processing system. InLuc Vincent and Theo Pavlidis,editors, SPIE/SPSE Vol. 2181,

Document Recognition, San JoseCA, February 1994.

[2] M. Nadler. A survey of documentsegmentation and codingtechniques. Comp. Vis., Graphicsand Image Processing, 28:240{262,1984.

[3] Luc Vincent. Page segmentationin Textbridge. UnpublishedDocument, Xerox DesktopDocument Systems, 1993.

[4] J.L. Fisher, S.C. Hinds, and D.P.D'Amato. A rule-based system fordocument image segmentation. InInt. Conf. on Pattern Recognition,pages 567{572, Atlantic City, NJ,June 1990.

[5] Jaekyu Ha, Robert Haralick, andIhsin Phillips. Document pagedecomposition by the bounding-box projection technique. InInt. Conf. on Document Analysisand Recognition., pages 1119{1122,Montreal, 1995.

[6] Theo Pavlidis and J. Zhou. Pagesegmentation by white streams. InInt. Conf. on Document Analysisand Recognition., pages 945{953,Saint-Malo, France, 1991.

[7] Henry S. Baird.Background structure in document

images. In Advances in Structuraland Syntactic Pattern Recognition,volume 5, pages 253{269. WorldScienti�c, 1992.

[8] A. Antonacopoulosand R.T. Ritchings. Flexible pagesegmentation usingthe background. In 12th Int. Conf.on Pattern Recognition, pages 339{344, Jerusalem, October 1994.

[9] Stephen V. Rice, Junichi Kanai,and Thomas A.Nartker. A preliminary evaluationof automatic zoning. Technicalreport, ISRI, 1993.

[10] Junichi Kanai, Tom A. Nartker,Stephen V. Rice, and GeorgeNagy. Performance metrics fordocument understanding systems.In Int. Conf. on Document Analysisand Recognition., pages 424{427,Tsukuba, Japan, October 1993.IEEE Computer Society Press.

[11] Junichi Kanai, Stephen V. Rice,Thomas Nartker, and George Nagy.Automatic evaluation of OCRzoning. IEEE Trans. Pattern

Anal. Machine Intell., 17(1):86{90,January 1995.

[12] Sabine Randriamasy, Luc Vincent,and Ben Wittner. An automaticbenchmarking scheme for pagesegmentation. In Luc Vincent andTheo Pavlidis, editors, SPIE/SPSEVol. 2181, Document Recognition,San Jose CA, February 1994.

[13] Berrin A. Yanikoglu and LucVincent. Ground-truthing andbenchmarkingdocument page segmentation. InInt. Conf. on Document Analysisand Recognition.,Montreal, August1995.

19

[14] Esko Ukkonen. Algorithmsfor approximate string matching.Information and Control, 64:100{118, 1985.

[15] Stephen V. Rice, Junichi Kanai,and Thomas A. Nartker. Thethird annual test of OCR accuracy.Technical report, ISRI, 1994.

[16] Stephen V. Rice, Frank Jenkins,and Thomas A.Nartker. The fourth annual test ofOCR accuracy. Technical report,ISRI, 1995.

[17] Sabine Randriamasy and LucVincent. Benchmarking pagesegmentation algorithms. In IEEEInt. Conf. on Computer Vision and

Pattern Recog., Seattle, WA, June1994.

[18] Luc Vincent and Berrin Yanikoglu.A complete environment forground-truthing and benchmarkingpage segmentation algorithms. InSymposium on Document Image

Understanding Technology, Bowie,MD, October 1995.

[19] Olivier Desforges and DominiqueBarba. Segmentation of complexdocuments multilevel images: arobust and fast text bodies-headersdetection and extraction scheme.In Int. Conf. on Document Analysisand Recognition., pages 770{773,Montreal, 1995.

[20] George Nagy and S. Seth. Aprototype document analysissystem for technical journals.IEEE Computer, 25(7):10{22, July1992.

[21] Lawrence O'Gorman. Thedocument spectrum for bottom-up layout analysis. IEEE Trans.Pattern Anal. Machine Intell.,15(10):1162{1173, October 1993.

[22] Theo Pavlidis and J. Zhou. Pagesegmentationand classi�cation. Comp. Vis.,

Graph. Im. Proc.: Graphical Modelsand Image Processing, 54(6):484{496, November 1992.

20

ATTRIBUTE_NAME ANGLED

ATTRIBUTE_NAME BAR-CHART

ATTRIBUTE_NAME BULLETS

ATTRIBUTE_NAME CELL-TABLE

ATTRIBUTE_NAME CURVED-TEXT

ATTRIBUTE_NAME HAND-DRAWN

ATTRIBUTE_NAME ILLEGIBLE

ATTRIBUTE_NAME INCOMPLETE

ATTRIBUTE_NAME OUTLINED

ATTRIBUTE_NAME PIE-CHART

ATTRIBUTE_NAME REVERSE_VIDEO

ATTRIBUTE_NAME SHADED_BACKGROUND

ATTRIBUTE_NAME TAB-TABLE

ATTRIBUTE_NAME UNDERLINED

ATTRIBUTE_NAME UPSIDE-DOWN

ATTRIBUTE_NAME VERTICAL

REGION_TYPE_NAME CAPTION

REGION_TYPE_NAME CELL

REGION_TYPE_NAME DROPCAP

REGION_TYPE_NAME FOOTER

REGION_TYPE_NAME FOOTNOTE

REGION_TYPE_NAME FRAMEBOX

REGION_TYPE_NAME GRAPHICS

REGION_TYPE_NAME GRID

REGION_TYPE_NAME HALFTONE

REGION_TYPE_NAME HEADER

REGION_TYPE_NAME HEADLINE

REGION_TYPE_NAME HRULE

REGION_TYPE_NAME INSET

REGION_TYPE_NAME LINEART

REGION_TYPE_NAME LOGO

REGION_TYPE_NAME RAISEDCAP

REGION_TYPE_NAME REG_TEXT

REGION_TYPE_NAME SIDEBAR

REGION_TYPE_NAME SIGNATURE

REGION_TYPE_NAME SUBTITLE

REGION_TYPE_NAME TABLE

REGION_TYPE_NAME TABLE_COLUMN

REGION_TYPE_NAME TABLE_ROW

REGION_TYPE_NAME TIMESTAMP

REGION_TYPE_NAME UNKNOWN_TYPE

REGION_TYPE_NAME VRULE

Fig. 4. Example of GroundsKeeper

start-up �le specifying all possible re-gion types and attributes.

1

-3

3

-4

-5 -6 -7

4

G4

G3

G2

G1

Fig. 5. Sample region map. Negativevalues correspond to unique overlaps.

G1

G2

TEXT

G1

G2

TEXT

S1

S1

a) b)

Fig. 6. Region correspondences. In a)the dashed segmentation region S1 cor-responds to the enclosing image regionG1, while in b) the segmentation regionS1 corresponds to the enclosed text re-gion, G2.

Vertical Horizontal

Fig. 7. Alignment types of regions (hor-izontal or vertical).

21

Segmentation region

Ground truth region

Fig. 8. When benchmarking the seg-mentation of an OCR system, only thepixels in the shaded area are found to bepart of an erroneous horizontal mergeof text regions.

IMAGE

TEXT

TEXT

TEXT

TEXT

TEXT

TEXT

IMAGE

IMAGE

TEXTIMAGE

Fig. 9. Display of the ground-truth re-gions for a document image.

TEXT 0

TEXT 1

IMAGE

IMAGE

IMAGE

Fig. 10. A segmentation result for thesame image as in Fig. 9. Note the hori-zontal merging of text galleys.

22

Fig. 11. Error map corresponding tothe segmentation result displayed inFig. 10. Pixels involved in a merging er-ror are shown in black.

Ground-truth �le: bhng11.sgtSegmentation �le: bhng11.rdi�Image �le: bhng11.tifZONING:Missed regions = 0.00Noise regions = 0.00Horizontal merges = 22.65Horizontal splits = 0.94Vertical merges = 0.00Vertical splits = 0.00

Fig. 12. Sample benchmarking output.

23

Pink Panther: A Complete Environment for Ground-truthing and ...

Documents

Transcript of Pink Panther: A Complete Environment for Ground-truthing and ...