D01 choueka dershowitz_word_spotting_algorithm

Querying a Large Corpus of

Historical Handwritten Manuscipts

Using Word-Spotting Alagorithms

Yaacov Choueka, Adiel ben-Shalom The Friedberg Genizah Project

Nachum Dershowitz, Lior Wolf, Adi Silberfenig

School of Computer Science, Tel Aviv University

Minerva 2015,

Jerusalem

The Problem: find all occurrences of a

given query-word in all the manuscripts

of the corpus (arbitrary language, arbitrary script)

Example:

The Cairo Genizah Corpus

360,000 fragments Hebrew characters, Hebrew and Arabic languages

The query: בראשית

Simple Solution: full-text search

KWIC Output

The catch:

The software can search only

manuscripts that have been

transcribed into electronic form!

Usually, however, most of the manuscripts

are never transcribed!

In the Genizah case:

480,000 images are available

only 40,000 (8%) have been transcribed!

OCR Does not work well

for handwritten historical documents

אהבתי כי ישמע יהוה את

קולי תחנוני כי הטה אוזנו לי

ובימי אקרא אפפוני חבלי מות

ומצרי שאול מצאוני צרה

ויגון אמצא ובשם יהוה

אקרא אנה יהוה מלטה

נפשי חנון יהוה וצדיק ואלוהינו

מרחם שומר פתאים יהוה

דלותי ולי יהושיע שובי נפשי

למנוחיכי כי יהוה גמל עליכי

כי חלצת נפשי ממות את עיני

מדמעה את רגלי מדחי

אתהלך לפני יהוה בארצות

החיים האמנתי כי אדבר אני

אדזבעיכישעידודארוליעחנוניכי

דסראזנויוביסיארראאוניחבליש

תומצרישאולצאוניצדוגוןאמצאו

בשםידוארראאנאידודלטכשינון

ידודוצדידואדינוסרחסשוערתאי

סיזוזדלייייליידושיעשובינשילסנ

וחיכיכיידודגמלעיכיכיחלצתנשי

ממועאעעיניסדסעדאערגליאעד

לךלניידודבארדחייפדאסנעיכיא

דבראניגליאעדל

OCR Transcription

Search for the image

of the query word

(and not for its text)

The word-spotting approach:

Given one (or more)

image(s)

of a query word,

find all occurrences of

similar images in the

corpus collection of

manuscripts’ images

Query:

Word-spotting

Query:

1. Binarization

2. Extracting Word-Candidates

(“Patches”) From a Manuscript’s Image

3. Patch Normalization

Normalizing every patch into a standard grid

of 8960 pixels (20*7 cells of 8*8 pixels each)

4. Image descriptors for every patch

Constructing, for every patch

an image-descriptor vector of

12,460 real numbers

140 cells * (31+58)=12,460

(31 features of HOG vector)

(58 features of LBP vector)

5. Dimension Reduction

12,460

M

Patch 1

Patch 2

Patch 3

Patch M

M = Total Number of Patches

In all images of the corpus

1000

M

Patch 1

Patch 2

Patch 3

Patch

M

PCA – Principal

Component Analysis

6. Similarity Computation

Computing an efficient

similarity measure

between

the query-reduced vector

and

the reduced vectors

of all patches of all

images in the corpus

Query Dataset

1000

M

Patch 1

Patch 2

Patch 3

Patch

M

Query Patch 1

1000

Result

M Similarity of Query

Patch to Patch

number i

7. Result Sort the results by decreasing similarity

and display the patches with the best

similarity to the query

Two Tests

Precision 50% 91%

Single query 0.08 sec 0.03 sec

Pre-processing per Page 46 sec 3 sec

1. George Washington – Handwritten

2. Lord Byron – Printed

20 pages, about 5000 words each

Current Problems

1. Efficiently building (off-line, in terms

of space and time) compact image-

descriptors for all patches from all

(half-a-million) images.

2. Building an efficient (on-line) system

for comparing the query vector to all

(100 million?) patches’ vectors

When solved and implemented

it will offer

new horizons

to the study of large corpora

of historical documents

Thank You

D01 choueka dershowitz_word_spotting_algorithm

Internet

Transcript of D01 choueka dershowitz_word_spotting_algorithm