Duplicate detection for quality assurance of document image collections

23
Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk 1 & Alexander Schindler 1,2 & Sven Schlarb 3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library

description

Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto. In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143. ISBN 978-0-9917997-0-1

Transcript of Duplicate detection for quality assurance of document image collections

Page 1: Duplicate detection for quality assurance of document image collections

Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library

Page 2: Duplicate detection for quality assurance of document image collections

Overview

Digital preservation & quality assurance

Digital image preservation workflows

Image duplicate detection

Keypoints and feature descriptors in Computer Vision

Bag of visual words

Results on a real-world data set

2 22.11.2012

Page 3: Duplicate detection for quality assurance of document image collections

SCAPE project and quality assurance

SCAlable Preservation Environments, EU FP7

Preservation Components:

improve and extend existing tools,

develop new ones where necessary,

apply proven approaches like

image and patterns analysis to the

problem of ensuring quality in digital

preservation

3 22.11.2012

Page 4: Duplicate detection for quality assurance of document image collections

Quality assurance in image preservation

Comparison of image content

- automatic image processing worflows (e.g. format conversion)

- reacquisition of images

Duplicate detection

- within a single collection (filtering)

- between collections (merging, comparison)

Solutions:

- page segmention + OCR

- feature based approaches

4 22.11.2012

Page 5: Duplicate detection for quality assurance of document image collections

Book scan sequence with duplicates

5 22.11.2012

Page 6: Duplicate detection for quality assurance of document image collections

Duplicate detection workflow

6 22.11.2012

Page 7: Duplicate detection for quality assurance of document image collections

Keypoint detection and description (1)

Keypoints are detected at salient image regions

A keypoint is described in a descriptor ( = vector of features)

Scalable Invariant Feature Transform - SIFT (Lowe, 2004)

7 22.11.2012

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

Page 8: Duplicate detection for quality assurance of document image collections

Keypoint detection and description (2)

Invariance w.r.t. color/tone transformation

Invariance w.r.t. rotation, scaling or translation

8 22.11.2012

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

Page 9: Duplicate detection for quality assurance of document image collections

Keypoint detection and description (3)

All detections (ordered by scale)

9 22.11.2012

Page 10: Duplicate detection for quality assurance of document image collections

Duplicate detection workflow

10 22.11.2012

Page 11: Duplicate detection for quality assurance of document image collections

Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]

Visual analogy: bag of visual words or bag of features

Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint”

Bag of visual words (1)

11 22.11.2012

Page 12: Duplicate detection for quality assurance of document image collections

12 22.11.2012

Bag of visual words (2)

Page 13: Duplicate detection for quality assurance of document image collections

Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250

Bag of visual words (3)

13 22.11.2012

Page 14: Duplicate detection for quality assurance of document image collections

Duplicate detection workflow

14 22.11.2012

Page 15: Duplicate detection for quality assurance of document image collections

Image comparison / duplicate detection schemes

Comparison of visual histograms – tf (“term frequency”) score

Inverse document frequency –idf

Spatial verification – sv detailed image comparison

15 22.11.2012

50 100 150 200 250 300 350 400 450 5000

2

x 10-3

50 100 150 200 250 300 350 400 450 5000

24

x 10-3

50 100 150 200 250 300 350 400 450 5000

2

x 10-3

Page 16: Duplicate detection for quality assurance of document image collections

Spatial verification (1)

Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity

16 22.11.2012

Page 17: Duplicate detection for quality assurance of document image collections

Spatial verification (2)

17 22.11.2012

Pair of possible duplicates Descriptor matching Estimation of affine transformation

Image overlay Similarity estimation

Similarity measure MSSIM

Page 18: Duplicate detection for quality assurance of document image collections

Duplicate detection (1)

Pairwise comparison for a collection of N pages

18 22.11.2012 0 50 100 150 200 250 300 350 400 450 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image index a

max

(Da)

Page 19: Duplicate detection for quality assurance of document image collections

Duplicate detection (2)

Robust outlier detection

19 22.11.2012

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image index a

max

(Da)

a=12..15

a=22..25

a=106,107

a=108,109

a=188..197 a=198..207

Page 20: Duplicate detection for quality assurance of document image collections

Comparison of duplicate detection schemes

20 22.11.2012

a) Visual histogram comparison - tf

b) tf and inv. document frequency - tf/idf

c) tf and spatial verification – tf/sv

Page 21: Duplicate detection for quality assurance of document image collections

Results

Manual vs. automatic detection

59 books, 34805 pages

53 books correctly processed

53/59 ≈ 90% correct

69 of 75 duplicate runs detected

69/75 ≈ 92% correct

Missing detections due to

heavily mixed content

21 22.11.2012

Page 22: Duplicate detection for quality assurance of document image collections

Conclusion and outlook

Workflows for duplicate detection for complex documents

Keypoint detection and description = purely image based

Bag of visual words provides fast matching

Spatial verification applied to shortlist

Robust thresholding scheme for duplicate identification

Evaluation at Austrian National Library

Integration on SCAPE platform for scalable preservation

22 22.11.2012

Page 23: Duplicate detection for quality assurance of document image collections

AIT Austrian Institute of Technology your ingenious partner [email protected]