From Web Documents to Old Books Works in Progress in Graphics Recognition Mathieu Delalandre Meeting...

24
From Web Documents to Old Books Works in Progress in Graphics Recognition Mathieu Delalandre Meeting of Document Analysis Group Computer Vision Center Barcelona, Spain Thursday 23th November 2006

Transcript of From Web Documents to Old Books Works in Progress in Graphics Recognition Mathieu Delalandre Meeting...

From Web Documents to Old BooksWorks in Progress in Graphics Recognition

Mathieu Delalandre

Meeting of Document Analysis Group

Computer Vision CenterBarcelona, Spain

Thursday 23th November 2006

Plan

• Short CV• Vector Graphics Indexing and Retrieval• Dropcap Image Retrieval

LITIS

Rouen

CVC

Barcelona

SCSIT

Nottingham

L3i

La Rochelle

Short CV

Personal InformationMathieu Delalandre, 32 years old

Academic Degrees1995-1998 Lic.Sc in Electronic

Rouen University, France1998-2001 M.Sc in Industrial Computing

Rouen University, France

Research PeriodsLength Position Laboratory Subject6 months Master LITIS symbol recognition3 ½ years PhD LITIS drawing understanding5 months Post-doc SCSIT vector graphics indexing13 months Post-doc L3i dropcap image retrieval2 months Contract LITIS performance evaluation3 years Post-doc CVC …

Plan

• Short CV• Vector Graphics Indexing and Retrieval• Dropcap Image Retrieval

<rect x="400" y="100" width="400“ height="200"fill="yellow" stroke="navy" stroke-width="10" />

What are vector graphics ?

Bitmap vs vector graphics

More accurate and lighter

Known vector graphics formats•AI (Adobe Illustrator)

•SVG (Scalable Vector Graphic)

•WMF (Windows Metafile)

•EPS (Encapsulated PostScript)

•DXF (AutoCAD)

•ClipArt

WMF penEPS Plane

Clipart cheese

Vector graphics are growing on Web2001 SVG 1.0

2004 SVG widely used structured documents [Mong’03], geographic maps [Chen’04], technical drawings [Kang’04]

2005 Powerful editors (Inskape, Webdraw, …)

2006 Internet Explorer and Mozilla Firefox support SVG

Application of vector graphics1982 Computer Aided Design (DXF ‘1982’)

1985 Office software (PS ‘1985’, CGM ‘1987’, WMF ‘1993’)

1996 Web (PNG ‘1996’, SVG ’2001’ ..)

Vector Graphics Indexing and Retrieval

Vector Graphics Indexing and RetrievalSystem overview [Doer’98] [Tom’03]

Look like pattern recognition approach

Features Extraction

Retrieval

Index

Doc 1

Doc 2

Doc 3

Square Junction

Graphics objects Model 1 Model 2

Adjacency Line Inclusion

Model 3

Content adaptation Structured indexIndexed objects

Level 2

Level 1

Level 3

3.3 28.3

6.6 10.0 3.3 13.2

3.3 13.2 6.6 6.6 1.6 3.3

Pattern frequency

Ranked patterns

Our key ideas

Features Extraction

RetrievalDoc 1

Doc 2

Doc 3

content adaptation

structured index

Indexing process must adapted to

document content

We can improve results by structuring

the index

Vector Graphics Indexing and Retrieval

You see 5 You have 9

Our approach

R3

R1

set of objects parsing and break-up

set of linefiltering then

junction detection

set of broke line

Before retrieve, we need to

extract features

What are the difficulties ?

R3

R1

R2

<rect x="400" y="100" width="400" height="200"fill="blue" /><rect x="650" y="200" width="400" height="200"fill="yellow" />

How to get R2 ?

We need a break-up

2121 jjijji yyyxxx

Sorting the bounding box

How to speed up the

process ?

x21x11 x12 x22

y21

y11

y12

y22

We need a clean-

up

Vector Graphics Indexing and Retrieval

Our approach (next)

line graph building

PolylineJunction

while 2-connex edge if 3-connex node

adjacency and inclusion

common vector

included bounding box

AdjacencyPolygon

Result example

gravity center

adjacency

line

inclusion

Time processing on ‘Mikado’ database

region detection

Polygon

while starting vector take

nearest vector

1

3

2

[Wen’01]

To work on graph take

time

Using vectorial

data

Vector Graphics Indexing and Retrieval

Features Extraction

RetrievalDoc 1

Doc 2

Doc 3

Performance evaluation

GT 1

GT 2

GT 3

To work on retrieval engine now ? How to evaluate the retrieval

results after ? We must work on performance evaluation before ?

How to get the ground truth ?Produce ground truth from

existing document take time, we must produce synthetic

document.

Our key ideaProduce true-life

document need much knowledge, it is harder to

do with a computer

We can produce ‘creasy’ but well formed documents, it is sufficient for performance

evaluation purposes

Synthetic document production

Production rules

-

++

-

2-connected

1-connected

1-connected

1-connected

2-connected

Production rules

0-n

0-n

1

1

O-n

‘Creasy’ but well formed drawing

Vector Graphics Indexing and RetrievalGraphical

Objects

General rules•object number•document size •object choice -probability distribution -rotation and scale range -position constraints -overlapped or not•…

Domain rules• must be connected• must be adjacent• must be include• can include•…

Noise rules• to scale line• to broke line• to move line• …

Low LevelPrimitives

I

II

III

(4) To move objects according to domain rules (5) To delete oldest alone objects ‘cycle number’

(6) Adding noise on low level primitives composing objects

while

Vector Graphics

Ground Truth

(1) To insert a new object while underhand object number (2) To move other objects if it can’t do (1) (3) To exit if it can’t do (1) and (2), then run (4) and (5)

In progress

rotate and scale rotate and distort

scale and overlap

Vector Graphics Indexing and Retrieval

Works doneFast graph building from vector graphics

Production of first synthetic documents

Works in progress …

To produce more complex synthetic documents …To work on model selection …To work on index structuration …

About project dot-line04/05 SCSIT Post doc02/06 IRCSET Application

A. Winstanley (NCG, Dublin University)

04/06 Eureka Meeting eConnector, HP Lab

06/06 ANVAR Applicationinformal agreement

11/06 EPEIRES contract

2007 To visit A. Winstanley (NCG, Dublin University)To take contact with M. Fonseca (IST, Lisbon

University)2008 JM Ogier plan to mount a European project

Plan

• Short CV• Vector Graphics Indexing and Retrieval• Dropcap Image Retrieval

Dropcap Image Retrieval

Old books of XV° and XVI° centuries

Bartolomeo (1534)Alciati (1511) Laurens (1621)

figure

dropcap

headlineheadline

Book 46

Page 1385

Graphics 4755 (3.4 per page)

Foreground pixel [Jour’05]

63% textual 37% graphical

Graphics type

41% dropcap59% others

Old Graphics

Graphics/Book

0100200300400500600700

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46

Books

Gra

ph

ics n

um

ber

Which part and kind of graphics

in old books

CESR Database

Vascosan 1555 Marnef 1576(1) Wood plug tracking

Printing house

plugexchange

copy

1531-1548

1511-1542

1555-1578

1497-1507

Dropcap Image Retrieval

In what are interested historian people with these

images ?

Wood plug(bottom view)

Retrieve similar printings

Plug 1 Plug 2 Plug 3

Printing 1

Printing 2

Why ?

Real time process or

not ?We can’t index all images in regard to

legal properties, a real time process will allow

to do queries with images provided by other digital libraries

DB

queryquery

results result

(2) User-driven historical metadata acquisition

Metadata file

Metadata file

Metadata file

Metadata file

Without retrieval

With retrieval more faster reduce error

descriptors

fast local

complex global

To scalar [Loncaric’98] Hough, Radon, Zernike, Hu,

Fourrier Scaled and

orientation invariant fast local (character, symbol, digit)

To image [Gesu’99] Template matching,

Hausdorff distance no scaled and

orientation invariant slow global (scene)

Dropcap Image Retrieval

Noise

Offset

Complexity

Accuracy

Scalability several hundred of classesseveral thousand of

images

What are the main

difficulties?

Which descriptor use ?

Not adapted for our problem

More adapted but too complex

Query

Compression

Centeringand

Comparison

R1 R2 R3

Filtering

Image Database

Our key idea

To use an image compressed

representation

We have started to work with our images but the file formats are

so different

Dropcap Image RetrievalCompression

Centering and

Comparison

Filtering

Digitalization problems [Lawrence’00]Several image providersSeveral digitalization toolsLong processHuman supervisedComplex post-processing plate-form…

Why ?

Dropcap Image RetrievalCompression

Centering and

Comparison

Filtering

Before to work on retrieval engine

historian people need tools to improve quality of their

databases

Our key idea

To develop an engine (QUEID) working on image

metadata to detect digitalization problem, and to

secure retrieve system

Diagnostic

Base

Expertise

QUEID

query

charts

analysis

Format

Diagnostic mode

(1) Software setting(2) Image exchange(3) Prototype software

250 to 350Resolutions

UncompressCompression

TiffFormats

grayModel

279.7 MpSize

2038Files

QUEID EngineBase

accepted

rejected

Parameters

Filtering mode

Our database

pixel

runc n

nt 1

]1,[ pixelrun nn

[1,0[ct

Compression rate/Dropcap

0,7

0,8

0,9

1

1 201 401 601 801 1001 1201 1401 1601 1801 2001

Dropcap

Co

mp

res

sio

n r

ate

0.75

0.950.88

Dropcap Image Retrieval

To use a Run Length Encoding (RLE) of Image

Our key idea

image foreground background both

Which kind of RLE ? both RLE

seems more adapted

Compression results

CompressionCentering

andComparison

Filtering

Centering

kg ,...2,1

lh ,...2,1lk

k

i i

jiikl

jyx h

ghd

10, min

x2 x2x2

x1x1 x1

x2 x2

x1

line (y) image 1

line (y+dy) image

2

xstack

reference

while x2 x1 handle image 2while x1 x2 handle image 1

Comparison

Time results Raster vs RLE

Raster sizes

0

200

400

600

1 201 401 601 801 1001 1201 1401 1601 1801 2001

Dropcap

Size

(k.p

ixel)

903.62600.8Max

337.06137.7Mean

176.677.74Min

Time s

Size k.pixel

Run Sizes

0

200

400

600

1 201 401 601 801 1001 1201 1401 1601 1801 2001

Dropcap

Size

(K.ru

n)

137.0687.8Max

41.6815.5Mean

22.321.1Min

Time s

Size k.run

r

n

ii tntt

1

image database

query image

Dropcap Image RetrievalCompression

Centering and

Comparison

Filtering

To solve the offset problems we must use a

centering step before the comparison

We can do it in an easy way by

comparing foreground histogram

Mean query of 40 s, how to reduce again

without using a lossless compression

and to loose accuracy ?

2121

2121

,max,max vvuu

vvuud

Level 1 : image sizes Level 2 : black, white pixelsLevel 3 : RLE comparison

Our first system

Dropcap Image Retrieval

How to process the distance

curve ?

2

clusterth

Distance curve

00,10,20,30,40,50,60,70,8

1 167 333 499 665 831 997 1163 1329 1495 1661 1827 1993

Dropcap

Dis

tan

ce

1

2

if 1 - 2 < 0

push x, cluster

while 1 - 2 < 0

next

Using a basic clustering algorithm

‘elbow criteria’

query

1st Level

2sd Level

To use a system appraoch using different level of operator (from more speed to more accurate) to

select image to compare

Our key idea

Speed

Depth

Selection results

Selection results

0%

20%

40%

60%

80%

100%

1 195 389 583 777 971 1165 1359 1553 1747 1941

Dropcap

Sel

ectio

n (%

)

SizesDensities

59%Max

24%Mean

4%Min

Selection%

From 4% to 59%, how to reduce the variability ? To work on a better selection criteria seems ambiguous

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

Dropcap Image Retrieval

To add an intermediate operator between

scalar and image data

Our key idea

BaseIHM

Retrieve engine

control

display

retrieve

Labels

driven labelling

Bench1 Bench2 Bench2To produce

Example of query result

0.1947 0.2517 0.3485 0.3616 0.3819 0.4064

Same plug

Next plug

Query

0.4109 0.4209

First results seem good, but how to get the ground truth and to evaluate our system?

Dropcap Image Retrieval

To use our engine to produce benchmark

database

Our key idea

Works doneQUEID to filter and analyse image database Speedup comparison using two feature

RLE compressionSystem approach

Works in progress …To add operator to improve systemTo extend our system to produce benchmark database

About project dot-line09/05 MADONNE Postdoc06/06 1er CESR Technical Meeting09/06 ANAGRAM Worshop (Fribourg)10/06 2sd CESR Technical Meeting10/06 NaviDoMass agreement2007 GDR-JC Project (LMA, LI, CreSTIC, LITIS, CVC)

To put online the system on CESR websiteold graphic working group (Glasgow, Tours

…)

Dropcap Image Retrieval

Bibliography

1. J. Mong and D. Brailsford. Using svg as the rendering model for structured and graphically complex web material. In Symposium on Document Engineering (DocEng), pages 88-91, 2003.

2. Y. Chen, J. Gong, W. Jia, and Q. Zhang. Xml-based spatial data interoperability on the internet. In Conference of International Society for Photogrammetry and Remote Sensing and Spatial Information Sciences (ISPRS), pages 167-201, 2004.

3. J. Kang, B. Lho, J. Kim, and Y. Kim. Xml-based vector graphics: Application for web-based design automation. In International Conference on Computing in Civil and Building Engineering (ICCCBE), pages 170-178, 2004.

4. M. Weindorf. Structure based interpretation of unstructured vector maps. In Workshop on Graphics Recognition (GREC), volume 2390 of Lecture Notes in Computer Science (LNCS), pages 190-199, 2002.

5. N. Journet, R. Mullot, J. Ramel, and V. Eglin. Ancient printed documents indexation: a new approach. In International Conference on Advances in Pattern Recognition (ICAPR), volume 3686 of Lectures Notes in Computer Science (LNCS), pages 513-522, 2005.

6. V. D. Gesu and V. Starovoitov. Distance based function for image comparison. Pattern Recognition Letters (PRL), 20(2):207-214, 1999.

7. S. Loncaric. A survey of shape analysis techniques. Pattern Recognition (PR), 31(8):983-1001, 1998.8. G. Lawrence and al. Risk management of digital information: A file format investigation. RLG DigiNews, 8(4), 2000.