Language tools bne-5-10-2011

65
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)

description

Presentation on language tools, presented by Jesse de Does and Katrien Depuydt during demo session held at the BNE 5th of October 2011.

Transcript of Language tools bne-5-10-2011

Page 1: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computer Lexica in OCR and Retrieval

Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)

Page 2: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4 March 2009 presentation The Hague 2

Can we handle ‘de wereld’ (‘the world’)’?

werreid

Page 3: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 3

OCR:Abbyy Finereader SDK with built in standard Dutch dictionary

OCR:Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch:

werreld

Page 4: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 4

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

RETRIEVAL: key in modern WERELD and find all

Page 5: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

5

The long s problem: An example ….

OCR at start of project

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

.

Page 6: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

6

The long s problem: An example ….

OCR at start of project Results April 2010

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Page 7: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

7

The long s problem: An example ….

OCR at start of project Results April 2010

A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.

A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.

Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon.

In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first)

Page 8: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 8

Overview

What is a computer lexicon

Lexica in IMPACT

Tools for lexicon building and applying lexica

Some results

Searching Demonstration

Page 9: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 9

What is a computer lexicon?

Page 10: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 10

Computer lexicon vs electronic dictionary (1)

An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online

Page 11: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 11

Dictionary XML (example)

Page 12: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 12

Page 13: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 13

Computer Lexicon vs Electronic Dictionary (2)

A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma wereld, part of speech noun, morphology werelden, werelds … , syntax)

Examples of use:

Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction…

Page 14: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 14

Page 15: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 15

Lexica in IMPACT

Page 16: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 16

The OCR lexiconAn OCR lexicon is

A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize

Page 17: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 17

OCR lexicon: example1550-1750 > 1900

song 820rihte 818theire 818manye 818sume 815Do 814Whiche 811fyrst 811while 811Water 810wt 809shalbe 808thingis 807again 806sona 806wa 805mode 804work 802between 801law 799moder 798mis 798softe 798

television 418electronic 375video 194hormone 176jazz 162eco 142software 136vitamin 128movie 121taxi 113isotopic 108electronics 95radar 86basically 71sabotage 71homozygote 70psychedelic 67phonemic 66insulin 64zap 64antibody 61fungicidal 61

Page 18: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 18

The IR lexicon IR lexicon: most

important information categoriesword forms (lists of words) +

- frequency information- quotes

(dated sources) from corpora or electronic dictionaries- MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the

same wordT

he modern lemma is used for searching in textsS

tandard use in corpus linguistics and modern historical lexicography

Page 19: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 19

<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>

Page 20: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 20

Tools for lexicon building and application of lexica

Page 21: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 21

Types variation (spelling, inflection…)uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

I

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

II

(patterns to predict variation)

(a number are predictable with patterns, others need to be taken from a lexicon )

Page 22: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Neil Fitzgerald, 7th July 2011 22

Page 23: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 23

Computer lexica

For OCR and OCR post correction Improving searchability of historic text material by building a lexicon

with variants by using a modern lemma as a search entry

Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook

Page 24: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 24

Tools (more specific)- Lexicon building from corpus material and dictionaries - Use of lexica in search engines

- Tool to extract spelling variation patterns from historical material

- Tool to relate previously unrecognised spelling variations to their standard form

- Tool to deduct previously unrecognised inflected forms to their basic form

Page 25: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

25

Spelling variation tools (pattern-based) Language-independent approach:

Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, ….

Pattern weights are computed from example material

Additional approaches possible, eg. : Use of aligned data (parallel historical text and modern version)

Page 26: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

26

Lemmatization Reduction of historical word forms to modern lemma Historical word standard (“modern”) spelling lemma form (pattern matching) (lemmatizer)

Dystels (1) distels (2) distel

When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup.

But: 1) We will not have full form information for many lemmata

(especially the historical ones)2) Even lemmata present in modern language may have historical

inflected forms different from the present-day paradigm

Page 27: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

27

Lemmatization and reverse lemmatizationWe also need a lemmatization process for these situations A typical lemmatizer assigns some standard form (infinitive,

nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.

But: Matching these patterns can be hard to combine with matching

both spelling variation patterns and OCR errors (bok/bokken/bokkeu)

We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata

This construction is carried out by means of a statistical reverse lemmatizer

Page 28: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

28

Attestation From hypothetical (non-witnessed) lexicon content to attested word forms in

“real” text Automatic selection of candidate attestations Manual work: verification and correction

Two approaches Dictionary based (INL): Woordenboek der Nederlandsche Taal Corpus based (LMU, INL): Dutch DBNL corpus

Page 29: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

29

IMPACT Dictionary Attestation Tool

work

• We are working on what works.

• Depart from me, ye that worke iniquity.

• She worcketh knittinge of stockings.

headword

Quotations

variants

Task Find the variants of a headword as they occur in the quotations

Lexicon building at work: Verifying attestations in historical dictionaries

Page 30: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

30

IMPACT Dictionary Attestation Tool

Automatically (preprocessing)

• match literally e.g: work work, Work

• match using existing lexica and lists e.g: work works, worked, wrought

• approximate matching e.g: work worke

By hand (using the tool)

• correct automatic mismatches e.g: works words, worms

• find missed matches e.g: work worketh, wrowght

Task Find the variants of a headword as they occur in the quotations

Electronic

historical

dictionary Database

with lemmata

and quotatioms

Page 31: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

31

IMPACT Attestation ToolTool

Lemma headword

Quotations

Sorted by uncertainty

Up-to-date overview of what is done and needs to be done

Done by this user so far

Page 32: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

32

IMPACT Lexicon Tool

Automatically (preprocessing = apply lemmatizer)

• match literally e.g: work work, Work

• match using existing lexica and lists e.g: work works, worked, wrought

• matching using spelling variation module e.g: uiterlijk uyterlick

By hand (using the tool)

• assign correct lemma e.g: was (N) zijn (V)

• group tokens belonging together e.g: konings zoon koningszoon

• select attestations

Task Find and verify attestations in a historical corpus

Page 33: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

33

Corpus-based lexicon building: Impact Lexicon Tool

Page 34: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

34

General vocabulary vs. Named entitiesT

ools for lexicon building described so far: applicable to general lexiconT

ools for NE recognition, classification and variant matching

- library requirement- distinguish general vocabulary from NE’s- avoid unpleasant mixups like Abimelech apemelk! (b/p; i/e; e/0; k/ch)

Page 35: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT workshop, Bratislava, May 7, 2010

35

Improvement of state of the art / innovation

We use existing computational linguistic approaches, but figure out how to apply them to historical language

We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools

Page 36: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

languages in IMPACTD

utch, German, English, Spanish, FrenchP

olish, Czech, Slovene and Bulgarian

-Cross language perspective paper

-Parallel OCR and IR experiments

-GT datasets

-Language tools: language independent

-Except from 3 core languages: proof of concept lexica

IMPACT <Demo Day BL, 12 July 2011> 36

Page 37: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR evaluation results(preliminary!)

Page 38: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

1. Czech Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších

zásad konstitucí ewropejských, 1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye

zlopověstných kousků starého Reinecke, 1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského,

téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805

Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu

Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla

lidského a gednotliwých geho částek, 1840

Page 39: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 40: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2.Dutch1

8th and 19th century books, newspapers, parliamentary papers

Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852

Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796

Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784

Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795

Page 41: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Precision: 0.8432889410216431 , Recall: 0.843331934927516

Page 42: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 43: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

English1

6th-19th century materialS

ources for lexicon building: OED, ECCO

Page 44: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 45: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

French1

7th century books

Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653

Dissertation de la philosophie en général, 1668

La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673

Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677

Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693

Page 46: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 47: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

German Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden

Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an

Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887

Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der

Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu

Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609

Page 48: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 49: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Polish Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z

tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót

Polaków z Wołoch w roku 1621, 1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes

podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746

Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie,

1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej,

1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony

Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589

Page 50: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 51: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Slovene Genovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za

mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855

Page 52: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 53: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT <Demo Day BL, 12 July 2011> 53

Retrieval demonstrator

Indexing and retrieval library (java) implemented on the lucene search engine

Lexicon in MySQL database

OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection

Page XML output [in framework]

NE tagging

Indexing and retrieval while using lexicon and NE tagging

53

Page 54: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 55: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 56: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 57: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 58: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 59: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 60: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 61: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 62: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 63: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 64: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 65: Language tools bne-5-10-2011

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.