aims to develop innova- tive, efficient and cost...

Post on 07-Oct-2020

0 views 0 download

Transcript of aims to develop innova- tive, efficient and cost...

Supported by: EU Cultural Heritage:

http://www.transcriptorium.eu

aims to develop innova-tive, efficient and cost-effective solu-tions for the indexing, search and fulltranscription of historical handwrittendocument images, using modern, holis-tic Handwritten Text Recognition tech-nology.

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

will turn Handwritten Text Recog-nition (HTR) technology into a mature tech-nology by addressing the following objectives:

1 Enhancing HTR technology for efficient tran-scriptionDeparting from state-of-the-art HTR approaches,

will capitalize on interactive-predictivetechniques for effective and user-friendly computer-assisted transcrition.

2 Bringing the HTR technology to usersExpected users of the HTR technology belong mainlyto two groups: a) individual researchers with expe-rience in handwritten documents transcription inter-ested in transcribing specific documents. b) volun-teers which collaborate in large transcription projects.

3 Integrating the HTR results in public webportalsThe HTR technology will become a support in thedigitization of the handwritten materials. The out-comes of the tools will be attached tothe published handwritten document images. This in-cludes not only full, correct transcriptions, but alsopartially correct transcription and other kinds of auto-matically produced metadata, useful for indexing andsearching.

Project coordinator: Joan Andreu Sanchez(jandreu@prhlt.upv.es)

Project no.: 600707Start date: 1 January 2013End date: 31 December 2015

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

Research groups and institutions:

• Pattern Recognition and Human Language Tech-nology Research Center (PRHLT) from Univer-sitat Politecnica de Valencia (Spain)

Principal researcher: Enrique Vidal(evidal@prhlt.upv.es)

• Department of German Language and Litera-ture Studies (UIBK) from University of Inns-bruck (Austria)

Principal researcher: Gunter Muhlberger(guenter.muehlberger@uibk.ac.at)

• Computational Intelligence Laboratory (CIL)from National Center for Scientific Research“Demokritos” (Greece)

Principal researcher: Basilis Gatos(bgat@iit.demokritos.gr)

• Centre for Digital Humanities (UCLDH) fromUniversity College London (United Kingdom)

Principal researcher: Philip Schofield(p.schofield@ucl.ac.uk)

• Institute for Dutch Lexicology (INL) (Nether-lands)

Principal researcher: Katrien Depuydt(Katrien.Depuydt@inl.nl)

• Digital Archives & Repositories Department(ULCC) from University London ComputerCentre (United Kingdom)

Principal researcher: Richard Davis(r.davis@ulcc.ac.uk)

CollectionsE

nglis

hD

utch

Ger

man

Span

ish

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

Research results• Document Image Analysis: Pre-processing, image en-

hancing, layout analysis, skew correction, line detection.

• Interactive Handwritten Text Recognition: the

user and the system interact for obtaining the correct transcript.

• Key Word SpottingQuery by string

Query by sample

• Linguistic Resources: language models, lexicon, abbre-

viations.

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

....

Data and tool results

Datasets and ground-truth:• English. Bentham: XVIII/XIX centuries, 80 000

pages, several hands. GT: 800 pages.

• Dutch. Four books: XV century, 2 000 pages, severalhands. GT: 200 pages.

• German. Several collections: from XVI to XX cen-turies, 32 000 pages, several hands. GT: 200 pages.

• Spanish. Plantas: XVII century, 7 000 pages, onehand. GT: 1 000 pages. Esposalles:, from XV to XXcenturies, 291 books, several hands. GT: 2 books.

Tools: DIA, interactive HTR and KWS tools.