Analyzing document collections via context-aware term extraction
Language-Independent Text Line Extraction from Historical Document Images First International...
-
Upload
willa-nichols -
Category
Documents
-
view
225 -
download
1
Transcript of Language-Independent Text Line Extraction from Historical Document Images First International...
2
Motivation
Historical handwritten manuscripts are valuable cultural heritage
Providing insights into both tangible and intangible cultural aspects from the past
Efforts to understand, manipulate and archive historical manuscripts
Digitization increases accessibility and allows automatic processing
*Courtesy: - wadod.com - Genizah Project
3
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
6
Connectivity & Components
4-Neighborhood 8-Neighborhood
We can define 4- or 8-paths depending on the type of connectivity specified
A set of pixels S is a Connected
Component if for each pixel pair
(x1,y1) є S and (x2,y2) є S there
is a path between them such that
every two successive pixels in the path
are in S and are X-neighbors. (X = 4, 8).
8
Distances
Given 2 points P = (u,v) , Q = (x,y) Euclidean Distance
City Block Distance
Chessboard Distance
In example: P = (1,8); Q = (4,1)
22 )()(),( vyuxQPde
vyuxQPd ),(4
vyuxQPd ,max),(8
10;6.7)7(3;7 422
8 ddd e
9
Distance transform
Given a set of pixels S, calculate the distance of other pixels to S The pixels in the set S will be considered as reference pixels
Let . We scan the image by a pre-defined connectivity :
First pass: Consider Green pixels (N1)
SP
11
3 2 2 1 0 0 0 1 2 3
3 2 1 1 0 0 0 1 2 3
3 2 1 0 0 0 0 1 2 3
3 2 1 0 0 0 1 1 2 3
2 2 1 0 0 0 1 2 2 3
2 1 1 0 0 0 1 2 3 4
2 1 0 0 0 0 1 2 3 4
2 1 0 0 0 1 1 2 3 4
Distance transform – (cont’d)
0 0 0 0 1 1 1 0 0 0
0 0 0 0 1 1 1 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 0 1 1 1 0 0 0 0
0 0 0 1 1 1 0 0 0 0
0 0 0 1 1 1 0 0 0 0
0 0 1 1 1 1 0 0 0 0
0 0 1 1 1 0 0 0 0 0
Binary RepresentationDistance transform
Chessboard metric = Reference pixels
Alef Letter - Arabic
Printed
Handwritten
12
Sign Distance transform
3 2 2 1 0 0 0 1 2 3
3 2 1 1 0 -1 0 1 2 3
3 2 1 0 -1 -1 0 1 2 3
3 2 1 0 -1 0 1 1 2 3
2 2 1 0 -1 0 1 2 2 3
2 1 1 0 -1 0 1 2 3 3
2 1 0 -1 -1 0 1 2 3 4
2 1 0 0 0 1 1 2 3 4
Sign Distance transform
chessboard metric
Alef Letter
Printed
Handwritten
13
Sign Distance transform – (cont’d)
Sign Distance transform (SDT)
Original Document Image
The brighter the color the larger the distance from reference pixels
14
Gradient A gray-scale image I is defined as a two-dimensional function
I(x,y)=gray
The gradient of the image (I ) is given by the formula :
Where:
is the derivative of the image in the horizontal direction
is the derivative of the image in the vertical direction
The magnitude of the gradient is defined by:
y
I
x
IIII yx
),(
x
I
y
I
22yx III
16
Background
De-noising Binarization
Page Layout Analysis
Text-line and word
Segmentation
Indexation and
Recognition
Pre-Processing
Segmentation
Original *Courtesy: Islamic manuscript, Leipzig
University Library, Germany
17
Text-line Extraction
Assigning the same color to each text line
Original Manuscript
Processed Manuscript*Courtesy: Juma Al-majid Center for Culture and Heritage, Dubai.
يــ ث ت ب
حـ خـ جـ
18
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
19
Challenges
A 19th century master thesis – SAAB medical Library, American University of
Beirut
Different slope (within the same line) Delayed strokes Overlapping components
Historical handwritten documents pose different challenges than those in machine-printed.
Looser layout format Line Proximity Multi-Oriented lines Touching components
20
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
21
Seam Carving
Content-aware image resizing
Original Image Calculated seams Resized
An energy function defines energy value for each pixel A seam is an optimal 8-connected path of low energy pixels
Gradient Image
22
Seam Carving – (cont’d)
let I be an n x m size image. Define a vertical seam to be:
where x is a mapping x : [1, . . . ,n] [1, . . . ,m].
Seam contains one, and only one, pixel in each row of the image, otherwise a distorted image might be obtained.
The pixels of the path of a seam will therefore be :
one can change the value of K in the constraint, and get either a simple column for k = 0 , or even completely disconnected set of pixels.
KixixiixiS ni |)1()(|,,))}(,{( 1
ni
n
iis ixiIsII 11 ))(,()(
23
Seam Carving – (cont’d)
Given an energy function e, the cost of a seam is:
We look for the optimal seam s* that minimizes this cost :
The optimal seam can be found using Dynamic programming
n
i
n
iis ixiIesIeIEsE
1 1
)))(,(())(()()(
n
iisIesES
1
* ))((minarg)(minarg
24
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
25
Text line representation by seams
Human perception of text lines
Tracks text lines by ink concentration and in-between line spaces
Two types of seams have been defined
*Courtesy: Wadod Center for masnuscripts.
26
Text line representation by seams -(cont’)
Original Document Image
Processed
The medial seam crosses the text area of a text line. A Separating seam is a path that passes between two consecutive text lines.
*Courtesy: Wadod Center for masnuscripts.
Medial Seam
Separating Seam
Seam Seed
27
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
28
Energy Map
We use the Sign distance transform (SDT) as an energy map
In SDT, pixels values are assigned according to their distance from the nearest reference pixel
Recall, distance values are negative inside connected components and positive in-between
Intuition: Local minima and maxima points determine the medial and separating seams, respectively
*Courtesy: Wadod Center for masnuscripts
Sign Distance Transform (SDT)
Original Document Image
29
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
30
Seam Generation – (cont’d)
The SDT is traversed horizontally to compute a cumulative energy map - Seam Map - for all possible connected seams for each entry (i,j):
))1,(*(min),(2),( 11 jlimapwjiSDTjimap ll
Sign distance transform
Left-to-right pass Right-to-left pass
SDT is traversed with two passes to enhance text line patterns Bi-linearly interpolate the resulting two maps
Interpolated map
31
Seam Generation – (cont’d)
The minimal entry of the last column is detected. Backtrack from the minimal entry to find the medial
seam.
Original Document Image Seam Map – One passSeam Map – Two passes
33
Seam Generation – (cont’d)
Then, why separating seams are needed?
Avoid recalculation of energy and seam maps after each line extraction
Avoid additional strokes classification (post processing)
34
Seam Generation – (cont’d)
Separating seams define the boundaries of text lines
Generated with respect to the medial seam of the corresponding text line
Grown from seam seeds toward the two sides of the image guided by the SDT
35
Seam Generation – (cont’d)
Seam fragment is a connected group of pixels defined as the closest local maxima along the vertical direction
Seam Map
Seam fragments with low priority are discarded Seeds candidate set is constructed
Medial Seam
Sign Distance Transform
The seed that generates the optimal (maximal cost) seam was chosen
36
Seam Generation – (cont’d)
The separating seams may diverge from the medial seamdue to the fork of ridges
Before After
A spring force anchored at the medial seam guides the separating seams
37
Touching/Overlapping Components
Usually, crossing overlapping components is avoided gracefully
Touching components are split too, but not necessarily in the optimal position
Processed
Processed
38
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
39
Experimental Results
Language Overlapping
Components
Lines
Description Dataset
Arabic and Spanish
516 1050 Wadod Center for Manuscripts
Wadod
Arabic 258 900 Al-Majid Center for Culture and Heritage,
Dubai
Al-Majid
English 485 420 American University of Beirut
AUB
English 317 150 Congress Library CongressLibrary
1576 2520
40
Correctness (%)Datase
tLine Low
erUpper
Medial
98 97 97 99 Wadod
97 97 96 98 Al-Majid
95 94 95 96 AUB
94.25 94 93 95 Congress
library
StrokeCrossing (%)
Overlapping
Components
Dataset
9 516 Wadod
2 258 Al-Majid
9 485 AUB
10 317 Congress
library
Experimental Results- (cont’d)
Table 1: correctness of text line extraction
Table 2: crossed components
42
Outline
Background
Challenges
Seam Carving
Text line representation by seams
Energy Map
Seam Generation
Experimental Results
Summary
43
Summary
Summary Language independent approach
Dynamic programming was used to find text lines
Saves energy map re-computing after text line extraction
Post processing steps are avoided
Crossing overlapping components was avoided in most cases
Still need more research to split touching components optimally