A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

11
Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica A Run Length Smoothing-Based Algorithm A Run Length Smoothing-Based Algorithm for non-Manhattan Document Segmentation for non-Manhattan Document Segmentation S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito Via Orabona, 4 - 70126 Bari – Italy Via Orabona, 4 - 70126 Bari – Italy {ferilli, esposito}@di.uniba.it {ferilli, esposito}@di.uniba.it {fabio.leuzzi, fulvio.rotella}@uniba.it {fabio.leuzzi, fulvio.rotella}@uniba.it L.A.C.A.M. http://lacam.di.uniba.it

description

Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run Length Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.

Transcript of A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Page 1: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica

A Run Length Smoothing-Based AlgorithmA Run Length Smoothing-Based Algorithmfor non-Manhattan Document Segmentationfor non-Manhattan Document Segmentation

S. Ferilli, F. Leuzzi, F. Rotella, F. EspositoS. Ferilli, F. Leuzzi, F. Rotella, F. EspositoVia Orabona, 4 - 70126 Bari – ItalyVia Orabona, 4 - 70126 Bari – Italy

{ferilli, esposito}@di.uniba.it{ferilli, esposito}@di.uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.itL.A.C.A.M.

http://lacam.di.uniba.it

Page 2: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

IntroductionIntroduction● Automatic document processing a hot topic

― Layout analysis a fundamental step● Identification of frames (relevant components in the document)● Performance can determine quality and feasibility of the whole process

● Two different…● Kinds of sources: Digitized (scanned) vs. Natively digital documents● Categories of layouts: Manhattan vs. Non-Manhattan● Types of algorithms: Top-down vs. Bottom-up

● Run Length Smoothing Algorithm● Manhattan Layout

● Other works exploit or try to improve the RLSA by setting its parameters● Many works on Manhattan layout

― Top-down strategies● Less works on non-Manhattan layout

― Bottom-up strategies

● The Manhattan assumption holds for many typeset documents, simplifies document processing…BUT cannot be assumed in general

Page 3: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

RLSO RLSO Application to scanned imagesApplication to scanned images

RLSO RLSO (Run Length Smoothing with OR)(Run Length Smoothing with OR)

2) vertical smoothingvertical smoothing with threshold tv, column by column

● llooggiiccaall OORR of the images obtained in steps 1 and 2

1) horizontal smoothinghorizontal smoothing with threshold th, row by row

th = 5

tv = 4

(AND)

Page 4: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

RLSO RLSO Application to scanned imagesApplication to scanned images

?

Page 5: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

RLSO RLSO Application to born-digital documentsApplication to born-digital documents

● Set horizontal/vertical distance thresholds th/t

v

● build a frame for each basic block

● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacenthorizontally adjacent basic blocks

and dh is the horizontal distance between them}

● for all (dh,1

, b’h,1

, b’’h,1

) ∈ H s.t. dh,1

≤ th merge the frames to which b’

h,1, b’’

h,1

belong

● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacentvertically adjacent basic blocks

and dv is the vertical distance between them}

● for all (dv,1

, b’h,1

, b’’h,1

) ∈ V s.t. dv,1

≤ tv merge the frames to which b’

h,1, b’’

h,1 belong

Reference blockReference blockAdjacent blocksAdjacent blocks

Non-adjacent blocksNon-adjacent blocksHorizontal distanceHorizontal distance

Vertical distanceVertical distance

Page 6: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

RLSO RLSO Application to born-digital documentsApplication to born-digital documents

Page 7: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

● Run Length Smoothing algorithms based on thresholds

― Hard to properly set manually (Not typical human activity)

― Heuristic approaches (Ad hoc)

― Tampers the idea of automatic processing

― Fixed thresholds not suitable to documents with several different

spacings

Automatic assessment of RLSO thresholds

RLSO RLSO

Page 8: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

RLSO RLSO Automatic threshold assessment Automatic threshold assessment

● Study of Run Lengths behavior

― Histogram very irregular● Peaks = most frequent spacings● Peak clusters = equally spaced

components― Hard to exploit by automatic

techniques

― Cumulative histograms more regular― Bar b = runs larger or equal than

b● Monotonically decreasing

― Flat zones = lengths for which no runs are present

● Scaled down to 10%― Reduces variability

H’(i) = ∑j≥ i

H(j)

Figure 1.

a fragment of scientific paper

Page 9: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

● Select threshold on flat zones― Derivative a good indicator

● Slope = 0● Discrete approximation on bar

b:― Tolerance possible

● Slope = – 30― Skip starting and trailing flat

zones● Starting zone = missing small

run lengths● Trailing zone = merge whole

content

● Iteration of technique on previously smoothed image― Finds progressively more

spaced components

b

(Figure 1-a/1-b) successive application of RLSO with automatic threshold assessment on Figure 1.

Figure 1-a.

Figure 1-b.

RLSO RLSO Automatic threshold assessment Automatic threshold assessment

Page 10: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

Sample EvaluationSample Evaluation

Page 11: A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

ConclusionsConclusions● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the

document image and fill them with black pixels whenever they are shorter than a

given threshold

– Both Manhattan and Non-Manhattan Layout

– Version for natively digital documents

● Automatic thresholding effective on documents having

– single character size

– different spacings

● Good baseline towards more complex documents

– different character sizes

– graphics

● Current and future Work

– Stop criterion for iteration

– Clustering based on positioning and spacing