D01 choueka dershowitz_word_spotting_algorithm
-
Upload
evaminerva -
Category
Internet
-
view
195 -
download
4
Transcript of D01 choueka dershowitz_word_spotting_algorithm
Querying a Large Corpus of
Historical Handwritten Manuscipts
Using Word-Spotting Alagorithms
Yaacov Choueka, Adiel ben-Shalom The Friedberg Genizah Project
Nachum Dershowitz, Lior Wolf, Adi Silberfenig
School of Computer Science, Tel Aviv University
Minerva 2015,
Jerusalem
The Problem: find all occurrences of a
given query-word in all the manuscripts
of the corpus (arbitrary language, arbitrary script)
Example:
The Cairo Genizah Corpus
360,000 fragments Hebrew characters, Hebrew and Arabic languages
The query: בראשית
The catch:
The software can search only
manuscripts that have been
transcribed into electronic form!
Usually, however, most of the manuscripts
are never transcribed!
In the Genizah case:
480,000 images are available
only 40,000 (8%) have been transcribed!
OCR Does not work well
for handwritten historical documents
אהבתי כי ישמע יהוה את
קולי תחנוני כי הטה אוזנו לי
ובימי אקרא אפפוני חבלי מות
ומצרי שאול מצאוני צרה
ויגון אמצא ובשם יהוה
אקרא אנה יהוה מלטה
נפשי חנון יהוה וצדיק ואלוהינו
מרחם שומר פתאים יהוה
דלותי ולי יהושיע שובי נפשי
למנוחיכי כי יהוה גמל עליכי
כי חלצת נפשי ממות את עיני
מדמעה את רגלי מדחי
אתהלך לפני יהוה בארצות
החיים האמנתי כי אדבר אני
אדזבעיכישעידודארוליעחנוניכי
דסראזנויוביסיארראאוניחבליש
תומצרישאולצאוניצדוגוןאמצאו
בשםידוארראאנאידודלטכשינון
ידודוצדידואדינוסרחסשוערתאי
סיזוזדלייייליידושיעשובינשילסנ
וחיכיכיידודגמלעיכיכיחלצתנשי
ממועאעעיניסדסעדאערגליאעד
לךלניידודבארדחייפדאסנעיכיא
דבראניגליאעדל
OCR Transcription
Given one (or more)
image(s)
of a query word,
find all occurrences of
similar images in the
corpus collection of
manuscripts’ images
Query:
Word-spotting
3. Patch Normalization
Normalizing every patch into a standard grid
of 8960 pixels (20*7 cells of 8*8 pixels each)
4. Image descriptors for every patch
Constructing, for every patch
an image-descriptor vector of
12,460 real numbers
140 cells * (31+58)=12,460
(31 features of HOG vector)
(58 features of LBP vector)
5. Dimension Reduction
12,460
M
Patch 1
Patch 2
Patch 3
Patch M
M = Total Number of Patches
In all images of the corpus
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
PCA – Principal
Component Analysis
6. Similarity Computation
Computing an efficient
similarity measure
between
the query-reduced vector
and
the reduced vectors
of all patches of all
images in the corpus
Query Dataset
1000
M
Patch 1
Patch 2
Patch 3
Patch
M
Query Patch 1
1000
Result
M Similarity of Query
Patch to Patch
number i
7. Result Sort the results by decreasing similarity
and display the patches with the best
similarity to the query
Two Tests
Precision 50% 91%
Single query 0.08 sec 0.03 sec
Pre-processing per Page 46 sec 3 sec
1. George Washington – Handwritten
2. Lord Byron – Printed
20 pages, about 5000 words each
Current Problems
1. Efficiently building (off-line, in terms
of space and time) compact image-
descriptors for all patches from all
(half-a-million) images.
2. Building an efficient (on-line) system
for comparing the query vector to all
(100 million?) patches’ vectors
When solved and implemented
it will offer
new horizons
to the study of large corpora
of historical documents