Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course,...

39
Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Prénom Nom Document Analysis: Introduction Prof. Rolf Ingold, University of Fribourg Master course,...

Prénom Nom

Document Analysis:Introduction

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

© Prof. Rolf Ingold

2

Outline

Introduction: definition and aims Applications overview Methodologies Possibility & limits Experience of the DIVA research group Course content and structure

© Prof. Rolf Ingold

3

What is a document ?

Data = abstract binary representation of any kind of information to be stored, transmitted or processed by computers

Information = data associated with an implicit or explicit interpretation

Document = piece of information that can be perceived and interpreted by humans to be perceived documents have to be rendered

displayed projected on screens printed played on speakers …

© Prof. Rolf Ingold

4

Taxonomy of documents

Documents may be Synthetic (structured) or captured (unstructured) Static (non temporal, printable) or dynamic (temporal) Viewable, audible or tactile

Animation

Syntheticdata

Captureddata

Static documents Dynamic documents

AudioImages

Graphics

Text (printed)

Off-line handwriting

On-line handwriting

Off-line handwriting Video Audio

Speech (synthetic)

© Prof. Rolf Ingold

5

What is document analysis ?

Document analysis aims of extracting symbolic information text (words, expressions, continuous text) graphics (vector graphics, shapes, symbols) layout structures logical structures numeric data writer / speaker identities, ...

from different captured sources images (scanned, camera based, synthesized) video on-line handwriting sound

© Prof. Rolf Ingold

6

Importance of document structures

Document = Content + Structures

Structures convey abstract high level information

They are revealed by styles

© Prof. Rolf Ingold

7

Structural document analysis

<DOCUMENT id=… >

<BODY>

<TITLE>A Master / Slave

Monitor … Network </TITLE>

<AUTHORS>

<AUTH>D.Jacobson</AUTH>

<AUTH>M. Shafiq</AUTH>

</AUTHORS>

<ABSTRACT><P>M…

Document analysis = Image Analysis of static documents to extract content and structures

Document analysis is applicable on captured images (from scanner, camera) synthetic images of electronic documents, available in

unstructured or purely structured form

© Prof. Rolf Ingold

8

Analysis of Electronic Documents

Most electronic documents are unstructured or poorly structured Document understanding can be seen as a reverse-engineering

task using a fixed-layout document format (such as PDF or XPS) as a pivot format

ASCII

© Prof. Rolf Ingold

9

Visual Audio Processing Chain

Visual Audio aims at recovering sound from old records by image analysis

© Prof. Rolf Ingold

10

Usefulness of document analysis

Extracting information from captured documents is useful in different contexts to avoid cumbersome keyboarding to capture information remotely to study the document’s content to categorize, classify and index digitized documents

for digital libraries culture preservation

to reuse document chunks to reedit and restyle an existing document to extract information for integrated applications

office automation database management information systems

to perform multimodal alignment

© Prof. Rolf Ingold

11

Typical applications of document analysis

Commercial products are available for Text reading (OCR products) Office automation (mail reading and dispatching) Form Processing (for dedicated applications)

More Specialized products Postal address reading Check reading and processing

© Prof. Rolf Ingold

12

Form processing

Performance of form processing depends on form complexity on form variability

Fields are located easily if their positions are fixed when using different colors

Content recognition is hard for several reasons degraded images approximate positioning of

symbols variability of handwriting

© Prof. Rolf Ingold

13

Check reading

Check reading can be automated at >90% difficulties: textured background, variability of writing easiness: fixed vocabulary, redundancy (legal & courtesy

amount), availability of contextual information (client database)

Legal Amount

Payee name

MICR

Date

Courtesy Amount

Signature

from <www.a2ia.com>

© Prof. Rolf Ingold

14

Table of contents recognition

Aim to extract information from TOC to index journals associate titles and

authors to page numbers

Advantages Very precise goal Regular layout for a given

jounal

Difficuties Complex layout Great variability when

considering journals universally

© Prof. Rolf Ingold

15

Analysis of historical documents

Aim to extract information to index historical documents

Challenges degradations irregular layout rich typography,

ornaments old scripts (no OCR)

Possible approach word spotting

© Prof. Rolf Ingold

16

Logical & physical document structures

Logical document structures Reflecting the author’s point of view Independent of presentation Composed of application dependent logical entities

Chapters, sections Specific to the application and document class

Physical document structures Reflects the editor’s point of view Composed of a hierarchy of physical entities

Text blocs, text lines and tokens Graphical primitives

Universal and independent of the document class

© Prof. Rolf Ingold

17

Document processing cycle

Physical Document

Logical Document

Paper Document

DocumentImage

Formatting Printing

Analysis and Recognition Digitizing

Document analysis can be considered as the reverse of formatting

Rendering

© Prof. Rolf Ingold

18

Relation between logical and physical structure

analysis

formatting

StylesLogical

StructurePhysical Structure

editprint

display

Document formatting is straightforward ... But document analysis is a non trivial task that generally can not be

fully automated

© Prof. Rolf Ingold

19

Processing chain

Blocs

Image

Simple text

Preprocessing

Postanalysis

OCR

Segmentation

Fonts

OFR

Doc understand. Structured docum.

Layout analysis

© Prof. Rolf Ingold

20

Pre-processing

Pre-processing aims at preparing the document image for further analysis; it includes Brightness / contrast enhancement Noise removal Skew / aberration correction Binarization / color clustering Shape smoothing

© Prof. Rolf Ingold

21

Segmentation

Document segmentation aims at splitting the image in regions of interests; it includes Page segmentation into blocs Text, graphics and images separation

Hairlines and frames detection Text bloc segmentation into text lines, words and characters In form processing, field separation Graphics segmentation into vectors and symbols

© Prof. Rolf Ingold

22

Optical Character Recognition (OCR)

OCR aims at extracting character codes (ASCII) from text images;

OCR was one of the earliest computer vision application Early patents were deposited in the 1910s, 30 years before

computer age !

OCR deals with many situations Isolated characters vs. complete words or phrases Different character classes (digits, uppercase letters, full text, …) Restricted or open vocabulary Machine printed vs. handwritten text Different languages (with various diacritics) and different scripts

(Latin, Greek, Hebrew, Arabic, Farsi, various Asian scripts, …,) Imperfect image quality (low resolution, textured background,

distortions, noise, …)

© Prof. Rolf Ingold

23

Text recognition related problems

Text analysis must also consider other aspects

In case of printed text Font recognition (family, size and style) Font categorization (with/without serifs, fixed vs.

proportional font)

In case of handwritten text Scriber identification or verification Scriber classification

© Prof. Rolf Ingold

24

Layout analysis

Layout analysis aims at extracting physical structures of documents; it consists of locating, delimiting and identifying

text blocks graphics tables formulas handwritten text fields annotations

associating figures and captions locating and delimiting headers and footers recovering the reading order (of multicolumn documents)

© Prof. Rolf Ingold

25

Example : layout modeling of scientific journals

© Prof. Rolf Ingold

26

Optical Font Recognition (OFR)

OFR aims at identifying the used fonts OFR is useful

for improving OCR accuracy, by using dedicated classifiers to distinguish “O” and “0”, “I” and “1”, …

for assigning logical labels, for logical structure recognition

Two strategies may be applied for OFR A priori OFR (without considering the content) A posteriori OFR (when the content is supposed to be known)

© Prof. Rolf Ingold

27

Document structure recognition

Document structure recognition (also referred to as document understanding) is the first step towards document interpretation

Document understanding is dealing with Logical labeling Logical structure recognition

Two levels of granularity are being considered macro-structure analysis labeling paragraphs / blocks micro-structure analysis labeling words / strings

Document structure recognition is still considered as an open issue There is no universal approach Solutions exist for dedicated document classes (museum

notices, checks, table of contents, scientific papers, newspapers, …

© Prof. Rolf Ingold

28

Two Levels of Structural Document Analysis

Physical structure analysis (also layout analysis) to locate and identify text block, graphics, tables, formulas,

handwritten text fields, annotations, … to recover the reading order

Logical structure analysis (also document understanding) to assign a hierarchy of logical labels first step towards interpretation

© Prof. Rolf Ingold

29

Use Case: Intelligent Newspaper Indexing

Full text indexing is not adequate for complex documents

Following items have to be identified headlines editorial articles (with title, author & function,

summary, content, links, ...) captions (associated to images) reader’s letters advertisement ...

© Prof. Rolf Ingold

30

Use case: Understanding Museum Notices

Group Vedette:

Area Title:Principal Title:

End of the title:

Area Address / Date:

Address:Date:

Area Collection:

Group Cote:from A. BelaïdLORIA-CNRS Nancy

Group Vedette:

Area Title:Principal Title:

End of the title:

Area Address / Date:

Address:Date:

Area Collection:

Group Cote:

Group Vedette:

Area Title:Principal Title:

End of the title:

Area Address / Date:

Address:Date:

Area Collection:

Group Cote:

© Prof. Rolf Ingold

31

Possibilities and limits of DA

Layout analysis is considered as almost solved for printed documents It can be achieved generically Problems remain for textured backgrounds and degraded

documents (historical & handwritten documents)

Document understanding is much less mature Solutions are application dependent Application of specific knowledge is needed (document models)

© Prof. Rolf Ingold

32

Need for Document Recognition Models

There is no universal approach !

Document recognition systems must be tuned for specific applications for specific document classes

Contextual information is required Models provide information like

generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases, ...) statistical information

© Prof. Rolf Ingold

33

Content of document models

Generic structure Document Type Definition (DTD) or XML-schema

Style information Absolute or relative positioning Typographical attributes & formatting rules

Semantics (if available) Linguistic information, keywords Application specific ontology

Probabilistic information Frequencies of items or sequences, co-occurrences

© Prof. Rolf Ingold

34

Trouble with document models

Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!)

Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally

© Prof. Rolf Ingold

35

Pattern Based Document Understanding (2-CREM) [Robaday 03]

Configurations consist of Set of vertices

Labeled (type) Attributed (pos, typo, ...)

Edges between vertices Labeled (neighborhood

relation) Attributed (geom, ...)

Model consists of Extraction rules For each class

Attribute selector List of pattern

extraction

configura-tion

model

classification

document image

rules

patt.

sele

cto

r

id

© Prof. Rolf Ingold

36

Performance evaluation

Performance evaluation is an important issue to compare algorithms to estimate corrections costs of real applications

Groundtruthed databases are required cost reduction by document analysis tools (bootstrap) synthetic data as alternative

© Prof. Rolf Ingold

37

List of Lessons

1. Introduction to document analysis and recognition

2. Document image processing

3. Fundamentals of pattern recognition I

4. Fundamentals of pattern recognition II

5. Printed text recognition

6. Font recognition

7. Layout analysis and segmentation

8. Logical structure analysis

9. Graphics recognition

10.Handwriting recognition

11.Reverse engineering of documents

12.Multimodal applications

© Prof. Rolf Ingold

38

Conclusion on document analysis

Document analysis is useful for many applications Commercial systems solve some of them

Advanced document analysis prototypes are developed in many research labs over the world

No universal documentation system is on the way

User assisted approaches may be a good trade-off for midsize applications

Structural document analysis will not disappear with exclusive electronic document handling (paperless office)

© Prof. Rolf Ingold

39

Organization of the course

Professor : Rolf Ingold, <[email protected]> Pérolles-2, B421, 026 300 84 66

Assistant : Jean-Luc Bloechle, <[email protected]>, Pérolles-2, B440, 026 300 92 94

Course : Tuesday, 09:15-10:00 & 10:15-11:00 Exercise : Wednesday, 11:15-12:00

requirements: 2/3 of series returned, 1/2 considered satisfactory Home work : estimated to 4-6 hours a week Website : http://diuf.unifr.ch/diva/web/ Examination :

oral, 20 minutes (alternatively written, 120 min) after spring semester (June 2008) or

summer (August-September 2008) Credits : 5ECTS