A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
-
Upload
university-of-bari-italy -
Category
Technology
-
view
1.096 -
download
0
description
Transcript of A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation
Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica
A Run Length Smoothing-Based AlgorithmA Run Length Smoothing-Based Algorithmfor non-Manhattan Document Segmentationfor non-Manhattan Document Segmentation
S. Ferilli, F. Leuzzi, F. Rotella, F. EspositoS. Ferilli, F. Leuzzi, F. Rotella, F. EspositoVia Orabona, 4 - 70126 Bari – ItalyVia Orabona, 4 - 70126 Bari – Italy
{ferilli, esposito}@di.uniba.it{ferilli, esposito}@di.uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.it{fabio.leuzzi, fulvio.rotella}@uniba.itL.A.C.A.M.
http://lacam.di.uniba.it
IntroductionIntroduction● Automatic document processing a hot topic
― Layout analysis a fundamental step● Identification of frames (relevant components in the document)● Performance can determine quality and feasibility of the whole process
● Two different…● Kinds of sources: Digitized (scanned) vs. Natively digital documents● Categories of layouts: Manhattan vs. Non-Manhattan● Types of algorithms: Top-down vs. Bottom-up
● Run Length Smoothing Algorithm● Manhattan Layout
● Other works exploit or try to improve the RLSA by setting its parameters● Many works on Manhattan layout
― Top-down strategies● Less works on non-Manhattan layout
― Bottom-up strategies
● The Manhattan assumption holds for many typeset documents, simplifies document processing…BUT cannot be assumed in general
RLSO RLSO Application to scanned imagesApplication to scanned images
RLSO RLSO (Run Length Smoothing with OR)(Run Length Smoothing with OR)
2) vertical smoothingvertical smoothing with threshold tv, column by column
● llooggiiccaall OORR of the images obtained in steps 1 and 2
1) horizontal smoothinghorizontal smoothing with threshold th, row by row
th = 5
tv = 4
(AND)
RLSO RLSO Application to scanned imagesApplication to scanned images
?
RLSO RLSO Application to born-digital documentsApplication to born-digital documents
● Set horizontal/vertical distance thresholds th/t
v
● build a frame for each basic block
● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacenthorizontally adjacent basic blocks
and dh is the horizontal distance between them}
● for all (dh,1
, b’h,1
, b’’h,1
) ∈ H s.t. dh,1
≤ th merge the frames to which b’
h,1, b’’
h,1
belong
● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacentvertically adjacent basic blocks
and dv is the vertical distance between them}
● for all (dv,1
, b’h,1
, b’’h,1
) ∈ V s.t. dv,1
≤ tv merge the frames to which b’
h,1, b’’
h,1 belong
Reference blockReference blockAdjacent blocksAdjacent blocks
Non-adjacent blocksNon-adjacent blocksHorizontal distanceHorizontal distance
Vertical distanceVertical distance
RLSO RLSO Application to born-digital documentsApplication to born-digital documents
● Run Length Smoothing algorithms based on thresholds
― Hard to properly set manually (Not typical human activity)
― Heuristic approaches (Ad hoc)
― Tampers the idea of automatic processing
― Fixed thresholds not suitable to documents with several different
spacings
Automatic assessment of RLSO thresholds
RLSO RLSO
RLSO RLSO Automatic threshold assessment Automatic threshold assessment
● Study of Run Lengths behavior
― Histogram very irregular● Peaks = most frequent spacings● Peak clusters = equally spaced
components― Hard to exploit by automatic
techniques
― Cumulative histograms more regular― Bar b = runs larger or equal than
b● Monotonically decreasing
― Flat zones = lengths for which no runs are present
● Scaled down to 10%― Reduces variability
H’(i) = ∑j≥ i
H(j)
Figure 1.
a fragment of scientific paper
● Select threshold on flat zones― Derivative a good indicator
● Slope = 0● Discrete approximation on bar
b:― Tolerance possible
● Slope = – 30― Skip starting and trailing flat
zones● Starting zone = missing small
run lengths● Trailing zone = merge whole
content
● Iteration of technique on previously smoothed image― Finds progressively more
spaced components
b
(Figure 1-a/1-b) successive application of RLSO with automatic threshold assessment on Figure 1.
Figure 1-a.
Figure 1-b.
RLSO RLSO Automatic threshold assessment Automatic threshold assessment
Sample EvaluationSample Evaluation
ConclusionsConclusions● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the
document image and fill them with black pixels whenever they are shorter than a
given threshold
– Both Manhattan and Non-Manhattan Layout
– Version for natively digital documents
● Automatic thresholding effective on documents having
– single character size
– different spacings
● Good baseline towards more complex documents
– different character sizes
– graphics
● Current and future Work
– Stop criterion for iteration
– Clustering based on positioning and spacing