Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc...

14
Dynamic Data in the humanities Marc Kemps-Snijders [email protected] EUDAT Dynamic Data Amsterdam September 25 th 2014

Transcript of Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc...

Page 1: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Dynamic Data in the humanities

Marc Kemps-Snijders

[email protected]

EUDAT Dynamic Data

Amsterdam

September 25th 2014

Page 2: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Dynamic data approach

Observation time

Archive

Ingest time

Ideally, data is stored the moment it is observed

Usually, data arrives late

…….or never at all

45°

From:

EUDAT meeting

September 2013

Barcelona

Page 3: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Over 2 M pages in 10.000 books

the period 1781 to 1800

Over 84 M unique articles

from 1618 to 1996

20 B words

Digitization started around 2000

- Scientists and general public

Provide accurately dated title, author and geographical information85.957 titles

92.276 authors

157.432 dependent titles

Creating uniformity and standardization for

heterogenity of collections

Page 4: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Data behaviour in a humanities

Virtual Research EnvironmentTime

Archive

Ingest time

Book

1623

Sept 2013

SAME record

Phenomena are recorded in single record

(metadata)

Author

1587-1679

Data arrives VERY late

Records are often related,

e.g. books and authors

Data needs to be curated……

Page 5: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Metadata curation example 1.

Sometimes authors appear twice in our system, e.g. due to spelling

variants or name variants.

In the 16th century authors sometimes published under their motto

rather than their own name

Example:

• „Liefde verwinnet al‟ (Love conquers all)

• „door Eén is 't nu voldaen‟ (by One it is all done now)

Joost van den Vondel

17 November 1587

5 February 1679 (aged 91)

Page 6: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Versioning and reproducibility

Time

Archive

Ingest time

45°

Het lof der zeevaert

Poem

1623

Oct 2013

Jan 2014

Joost van den Vondel

1587-1679

Lucifer

Drama

1654

Reproducibility prevents objects

from being thrown away

Query 1: How many titles are available for Vondel?

Answer: one

Query 1: How many titles are available for Vondel?

Answer: two

Query 1 is not reproducible

Add Archive Ingest time stamp

Add expiration time stamp

Select title where ArchiveIngestTime(title) < ArchiveIngestTime(query)

and ExpirationTime(title) > ArchiveIngestTime(query)

AIT:Oct 2013

Exp: Jan 2014

AIT:Jan 2013

Exp: -

AIT: Nov 2013

Exp: -

Page 7: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Data curation example 2.

editions are to be split

up into source texts

and editorial para texts

Published 1987

Published 1623

Published 1613

J. Van den Vondel

Twee zeevaart gedichten Marijke Spies

Joost van den Vondel

Lof der zee-vaert

Hymnus…..

Editions provide an additional challenge

• Recently published

• Consists of fragments of modern

and old Dutch

Page 8: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Data curation example 3.

OCR digitized newspaper articles

sometimes prove to be of poor

quality, e.g.

• Older articles

• WW II articles

Published June 14th 1618

VVt Venetien den 1.Iunij, Anno 1618.

'sx En 25 Mssaro is 3^ adviseert wozdel,/van H^ .,et gcyor uerracc aihi. r / twclcn

l> / zynde vele d« r srlver gtlustlreert duer onder eeulghe Franc0i>scn/die stch,net

deSpaellschcn cndc eenlghen d.ftr «lödellupden verdondcn dcse Stadt aen 50

ende meer in bzam lc stchen/ ende re plunderen ghelncllmendanaense» her plaetsen

de met vicrwerr heest glMonden/het w l̂ctle ccnc hunner mede gesellen el n deser

mlldccllr heeft / den welc- Kcn sp 2f.duuscnt ducaten hebben vereen: Als sulckr

hebben vernomen/znnderbp 70l>.wechghtloopcn Doch vanglvanzihcn / ende dcse

40. uan V^vua al hier ghcdzacht^oock noch dagnelhcnr van daer ende Verona/

Bcrgamo / en andere plaetsen ghevanckellicn gcvzacht werden: dese ol>ser Salien

dledacr toe gheholpen/zijn des nachts van wegen harcr grooter vrienden ver»

wo)5en/cut>c Komen daghclc)cllt noch Wouderlycne sanen aen den dach / sonderltjcnen

dat deSpaensche dele Stadt alsomncme wilde

Crowd sourcing project are underway

to provide accurate

transcriptions

Collaboration with Royal Library

VVt Venetien den I.Iunij, Anno 1618.

DEn 25. Passato is geadviseert worden, van het groot verraet alhier, 't welck

ontdeckt is, zijnde vele der selver gerusticeert daer onder eenighe Francoysen

die sich met de Spaenschen ende eenighen deser Edelluyden verdonden dese

Stadt aen 50 plaetsen ende meer in brant te steken, ende te plonderen,

ghelijck men dan aen seker plaetsen by de 50. potten met vierwerc heeft

ghevonden, het welcke eene hunner mede gesellen aen deser Seign. ontdeckt

heeft, den welcken sy 25. duysent ducaten hebben vereert: Als sulckx die

andere hebben vernomen, zijnder by 700; Wech gheloopen. Doch 20. daer van

gevanghen, ende dese daghen 40. van Padua al-hier ghebracht, oock noch

daghelijckx van daer ende Verona, Vicenza, Bergamo, ende andere plaetsen

ghevanckelijck gebracht werden: dese onser Natien alhier die daer toe

gheholpen, zijn des nachts van wegen harer grooter vrienden verdroncken

worden, ende komen daghelijckx noch wonderlijcke saken aen den dach,

Page 9: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Annotations

Linguistic annotations are at the heart of scientific data processing,

e.g. Part of Speech tagging, Named Entity Recognition, Syntactic

analysis, Coreference, Semantic Role Labeling.

Ga er nog eens op uit in Amsterdam!

1 Ga gaan [ga] WW(pv,tgw,ev) 0.993151 0 ROOT

2 er er [er] VNW(aanw,adv-pron,stan,red,3,getal) 0.972222 1 mod

3 nog_eens nog_eens [nog]_[eens] BW()_BW() 0.980727 1 mod

4 op op [op] VZ(fin) 0.920000 1 pc

5 uit uit [uit] VZ(fin) 0.936170 4 hdf

6 in in [in] VZ(init) 0.998321 1 mod

7 Amsterdam Amsterdam [Amsterdam] SPEC(deeleigen) 1.000000 0 ROOT

8 ! ! [!] LET() 0.995005

Most tools need to be trained or are designed to deal with specific

language periods (commonly modern language).

The result often needs to be manually corrected.

Interoperability across tools is often an issue (tagsets and processing methods).

Lemma=“Amsterdam”Postag= SPEC(deeleigen) Frog

Postag=N(eigen,ev,basis,onz,stan) Alpino

Word="Ga”Lemma=“ga” Frog

Lemma=“uit_gaan” Alpino

Page 10: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Annotations

Ideally produce

• Training corpora (manually corrected)

• Preprocessed annotated data (sometimes using different tools)

• (Manually) corrected annotated data

Used trainingscorpus

Processed resource

Based on

Manually corrected

Book

Training corpus

BookAnnotation

Annotation

Manually corrected Training corpus

e.g. from the same time period

Page 11: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Nederlab Virtual Research Environment

With over 37.5 M documents and 1.277.188.758 words currently

available in the environment this becomes quite a difficult

process to manage.

And we have ongoing discussions on acceptable methods for

maintaining this environment over prolonged periods of time.

• How to handle dynamic behaviour of data?

• Under which conditions can data be phased out?

• Should ALL data be integrated into the environment?

At least for metadata management a separate editorial environment

has been set up to limit the amounts of potential updates (and

versions) in the system.

Page 12: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Nederlab Virtual Research Environment

Over 2 M pages in 10.000 books

the period 1781 to 1800

Harmonization tool

Metadata editor

VRE

Page 13: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Concluding remarks

Efficient versioning appears to be the key towards dynamic data

management

• Maintain version history

• Assign appropriate time stamps

• When dealing with large quantities of data decide upon criteria

for phasing out of data

• When dealing with heterogeneous collections from different

sources, including automated enrichment processes, great care

must be taken to maintain overall data integrity

– Both data and metadata may be affected

– Must be evaluated on a case by case basis

– In our domain data dynamics is not limited to a single project or

organization!!! Data may originate from different overlapping sources and

different approaches may have been applied (e.g. data enrichment

processes)

Page 14: Dynamic Data in the humanities - EUDAT - Research Data ... · Dynamic Data in the humanities Marc Kemps-Snijders Marc.kemps.snijders@meertens.knaw.nl EUDAT Dynamic Data Amsterdam

Thank you for your attention

Marc Kemps-Snijders

[email protected]

EUDAT Dynamic Data

Amsterdam

September 25th 2014