Impact Analysis of OCR Quality on Research Tasks in Digital Archives

21
Impact Analysis of OCR Quality on Research Tasks in Digital Archives Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam

Transcript of Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Page 1: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Impact Analysis of OCR Quality on Research Tasks in Digital ArchivesMyriam C. Traub, Jacco van Ossenbruggen, Lynda HardmanCentrum Wiskunde & Informatica, Amsterdam

Page 2: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Context

✤ Research in collaboration with the National Library of The Netherlands

✤ Digital newspaper archive:

✤ 10 million pages covering 1618 to 1995

✤ +/- 1200 newspaper titles

✤ Available data: scanned image of the page, OCRed text and metadata records

2

Page 3: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Interviews

✤ Aim:

✤ Find out what types of research tasks scholars perform on digital archives

✤ Which quantitative / distant reading tasks are not (sufficiently) supported

✤ Scholars with experience in performing historical research on digital archives

3

Page 4: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Categorization of research tasks

T1 find the first mention of a concept

T2 find a subset with relevant documents

T3 investigate quantitative results over time

T3.a compare quantitative results for two terms

T3.b compare quantitative results from two corpora

T4 tasks using external tools on archive data

Page 5: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

5

I mostly use digital archives for exploration of a topic, selecting

material for close reading (T1, T2) or external processing (T4).

OCR quality in digital archives / libraries is partly very bad.

I cannot quantify its impact on my research tasks.

I would not trust quantitative analyses (T3a, T3b) based on this data sufficiently to use it in publications.

Page 6: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Literature

✤ OCR quality is addressed from the perspective of the collection owner/OCR software developer

✤ Usability studies for digital libraries

✤ Robustness of search engines towards OCR errors

✤ Error removal in post-processing either systematically or intellectually

6

Page 7: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

We care about average

performance on representative subsets

for generic cases.

I care about actual performance

on my non-representative subset

for my specific query.

7

Two different perspectives of quality evaluation

Page 8: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Use case

✤ Aims:

✤ To study the impact on research tasks in detail

✤ Identify starting points for workarounds and/or further research

✤ Tasks T1 - T3

8

Page 9: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

T1: Finding the first mention

✤ Key requirement: recall

✤ 100% recall is unrealistic

✤ Aim: Find out how a scholar can assess the reliability of results

9

Page 10: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

“Amsterdam”

1642

10

First mention of …

… in the OCRed newspaper archive of the KB?

1618

earliest document

Page 11: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

OCR

pre-processing

post-

processing

ingestion

scanning

11

Understanding potential sources of bias and errors

✤ many details difficult to reconstruct

✤ essential to understand overall impact

Page 12: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

“Amsterdam”

1642

12

First mention of …

… in the OCRed newspaper archive of the KB?

1618

earliest document

“Amfterdam”

1624

Page 13: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

01

OCR confidence values useful?

✤ Available for all items in the collection: page, word, character

✤ Only for highest ranked words / characters, other candidates missing

✤ This information would be required to estimate recall.

13

Page 14: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Confusion table

✤ Applied frequent OCR confusions to query

✤ 23 alternative spellings, but none of them yielded an earlier mention

✤ Problem: long tail

Amstcrdam 16-01-1743 Amstordam 01-08-1772 Amsttrdam 04-08-1705 Amslerdam 12-12-1673 Amslcrdam 20-06-1797 Amslordam 29-06-1813 Amsltrdam 13-04-1810 Amscerdam 17-10-1753 Amsccrdam 16-02-1816 Amscordam 01-11-1813 Amsctrdam 16-06-1823 Amfterdam already found Amftcrdam 17-08-1644 Amftordam 31-01-1749 Amfttrdam 26-11-1675 Amflerdam 03-03-1629 Amflcrdam 01-03-1663 Amflordam 05-03-1723 Amfltrdam 01-09-1672 Amfcerdam 22-04-1700 Amfccrdam 27-11-1742 Amfcordam - Amfctrdam 09-10-1880

correct confused

s f

n u

e c

n a

t l

t c

h b

l i

e o

e t

full table available online:http://persistent-identifier.org/?identifier=urn:nbn:nl:ui:18-23429

Page 15: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

“Amsterdam”

1642

“Amfterdam”

1624

“Amsterstam”

1619

15

First mention of …

1618

… in the OCRed newspaper archive of the KB?

earliest document

Page 16: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

“Amsterdam”

1642

“Amfterdam”

1624

“Amsterstam”

1619

16

Update!

1618

Corrections for 17th century newspapers were crowdsourced!

earliest document

“Amsterdam”

1620

Page 17: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

… but why not 1619?

Page 18: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Confusion Matrix OCR Confidence Values

Alternative Confidence Values

available: sample only full corpus not available

T1 find all queries for x, impractical

estimated precision, not helpful

improve recall

T2 as above estimated precision, requires improved UI

improve recall

T3 pattern summarized over set of alternative queries

estimates of corrected precision

estimates of corrected recall

T3.a warn for different susceptibility to errors

as above, warn for different distribution of confidence values

as above

T3.b as above as above as above

18

Page 19: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

No silver bullet

✤ we propose novel strategies that solve part of the problem:✤ critical attitude (awareness and better support)

✤ transparency (provenance, open source, documentation, …)

✤ alternative quality metrics(taking research context into account)

19

Page 20: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Conclusions

Problems

✤ Scholars see OCR quality as a serious problem, but cannot assess its impact

✤ OCR technology is unlikely to be perfect

✤ OCR errors are reported in terms of averages measured over representative samples

✤ Impact on a specific research task cannot be assessed based on average error metrics

Start of solutions

✤ Impact of OCR is different for different research tasks, so these tasks need to made be explicit

✤ OCR errors often assumed to be random but are often partly systematic

✤ Tool pipelines and their limitations need to be transparent & better documented

Page 21: Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Translate the established tradition of source criticism to the digital world and create a new tradition of tool criticism to systematically identify and explain technology-induced bias.

#toolcrit21