Automatic Extraction of Individual and Family Information from Primary Genealogical Records
description
Transcript of Automatic Extraction of Individual and Family Information from Primary Genealogical Records
Automatic Extraction of Automatic Extraction of Individual and Family Individual and Family
Information from Information from Primary Genealogical Primary Genealogical
RecordsRecords
By By
Charla Woodbury Charla Woodbury October 17, 2006October 17, 2006
2
Digital Images – Human Digital Images – Human IndexIndex
• Large number of competing family history websites•Digital images
•Human indexes – Double entry
• Researchers hunting through records and indexes to put families together
3
ProblemProblem
Large amounts of primary genealogical Large amounts of primary genealogical datadata
Big projects to index and extract recordsBig projects to index and extract records
Two independent indexers and Two independent indexers and adjudicationadjudication
Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families
4
Automated Extraction Automated Extraction SolutionSolution
Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data
Develop expert logic and rules thatDevelop expert logic and rules that Match and merge individuals Match and merge individuals
Group them into familiesGroup them into families
5
MethodsMethods
Prepare for the records extractionPrepare for the records extraction
Run a 1Run a 1stst PASS to extract the PASS to extract the informationinformation
Run a 2Run a 2ndnd PASS to match individuals PASS to match individuals and link familiesand link families
Evaluate and optimize the resultsEvaluate and optimize the results
6
Prepare for Records Prepare for Records ExtractionExtraction
Build an Ontology Build an Ontology BYU ontology software BYU ontology software Ontos Ontos to interpret and to interpret and
correctly label genealogical data usingcorrectly label genealogical data using DataframesDataframes Regular expressions Regular expressions LexiconsLexicons Conversion functionsConversion functions
““encapsulates knowledge about the appearance, encapsulates knowledge about the appearance, behavior, and context of a collection of data behavior, and context of a collection of data elements” Dr. David Embley elements” Dr. David Embley
Collect machine-readable recordsCollect machine-readable records
7
Ontology – Entity LevelOntology – Entity Level
8
Danish Danish GIVEN NAMEGIVEN NAME LEXICONLEXICON
MALEMALE Anders –And.Anders –And. AndreasAndreas Christen –KristenChristen –Kristen Christian –KristianChristian –Kristian Erik –EricErik –Eric GregersGregers HansHans Ib –Jep –JeppeIb –Jep –Jeppe JacobJacob JensJens Johan – Johannes – Joh.Johan – Johannes – Joh. Jorgen –JørgenJorgen –Jørgen KnudKnud Lars – Laurs – Laurids –LauritzLars – Laurs – Laurids –Lauritz Mads –Mats - MatsMads –Mats - Mats
FEMALEFEMALE Ane – Anna – AnneAne – Anna – Anne Birthe – BirteBirthe – Birte BodilBodil CarolineCaroline Dorthe – DorteDorthe – Dorte Ellen -Helene -EleneEllen -Helene -Elene Elisabeth –Elsbeth –LisbethElisabeth –Elsbeth –Lisbeth Else –IlseElse –Ilse IngeborgIngeborg IngerInger KarenKaren Kirsten –Christen –Kirstine –Kirsten –Christen –Kirstine –
Christine –Kirstine –ChirstineChristine –Kirstine –Chirstine MaleneMalene MarenMaren
9
DATEDATE Lexicon Lexicon Adds Thesaurus of SynonymsAdds Thesaurus of Synonyms
MONTHSMONTHS January –Jan –Januar -11brJanuary –Jan –Januar -11br Februrary –Feb –Februar -12brFebrurary –Feb –Februar -12br March –Mar –MartsMarch –Mar –Marts April – Apr –AplApril – Apr –Apl May –MaiMay –Mai June –Jun –JuniJune –Jun –Juni July –Jul –Juli -5brJuly –Jul –Juli -5br August –Aug –Augst -6brAugust –Aug –Augst -6br September –Sep –Sept -7br –SeptembreSeptember –Sep –Sept -7br –Septembre October –Oct -8br –OctobreOctober –Oct -8br –Octobre November –Nov -9br –NovembreNovember –Nov -9br –Novembre December –Dec -10brDecember –Dec -10br
TIMETIME Year –yr –aar –årYear –yr –aar –år Month –mo –maaned –m.Month –mo –maaned –m. Week –uge –ug.Week –uge –ug. Day –dag –d.Day –dag –d. Hour – h. –hr.Hour – h. –hr.
FEAST DATESFEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P.Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -PinPentecost – Pent –Pinse -Pin Trinity –Tr –Trin –TrinitatisTrinity –Tr –Trin –Trinitatis
DAYS OF WEEKDAYS OF WEEK Sunday –Sun –Dominico –Dom.Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond.Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd.Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd.Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd.Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred.Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –LørsSaturday –Sat –Lørsdag –Lørs
10
CONVERSION FUNCTIONSCONVERSION FUNCTIONSinside the ontologyinside the ontology
Compute birth date from age at deathCompute birth date from age at death
Death date – 22 Mar 1743 Death date – 22 Mar 1743
Age - 23 yr 2 mAge - 23 yr 2 m
->-> BIRTH Jan 1720BIRTH Jan 1720
Compute dates from feast dates Sunday 23rd after Trinity 1751
->-> 14 Nov 1751
11
Collect Machine-Readable Collect Machine-Readable RecordsRecords
12
English Parish – English Parish – Wirksworth, DerbyWirksworth, Derby
1608-18131608-1813
13
Danish Parish – Maglebye1646-1813
14
Sample Danish Sample Danish marriagesmarriages
15
New England – Beverly, New England – Beverly, Mass.Mass.
1668-18491668-1849
16
2 Run a 12 Run a 1stst pass to extract pass to extract the informationthe information
Annotate the genealogical record Annotate the genealogical record with the ontologywith the ontology
Populate RDF data filePopulate RDF data file
17
Annotated Town RecordAnnotated Town Record SOURCE –SOURCE –Beverly town recordsBeverly town records
[PAGE HEADER][PAGE HEADER] BirthsBirths page 391 page 391
[BODY][BODY] WOODBURY, Benjamin,WOODBURY, Benjamin, s.s. NickolasNickolas and and Anne,Anne, bp.bp. 26 : 2 m : 1668.26 : 2 m : 1668.
NAMENAME <NAME><NAME>DATEDATE <DATE><DATE>PLACEPLACE <PLACE><PLACE>RELATIONSHIPRELATIONSHIP <RELATION><RELATION>OCCUPATIONOCCUPATION <OCCUPATION><OCCUPATION>RECORD_TYPERECORD_TYPE <RTYPE><RTYPE>SOURCESOURCE <SOURCE><SOURCE>
18
Annotated Danish ParishAnnotated Danish Parish
SOURCE -SOURCE -Tvilum Parish RegisterTvilum Parish Register
[PAGE HEADER][PAGE HEADER] FøddeFødde 17511751 page 3 page 3
[BODY][BODY] TruustTruust Dom. 23 p: Trinit: Dom. 23 p: Trinit: laest laest over over Niels BachesNiels Baches SØRENSØREN fadd.fadd. Johannes MichelsensJohannes Michelsens og og NielsNiels Mollers Mollers hustruerhustruer af af SøebyevadSøebyevad, , Peder Peder RasmussenRasmussen af af SøebyevadSøebyevad, , Jens BachisJens Bachis sønsøn PederPeder og og Niels ThylkesNiels Thylkes s.s. PederPeder af af TruustTruust
19
Populate RDF-data filePopulate RDF-data file
Hilton Campbell’s designHilton Campbell’s design
PERSONPERSON
EVENTEVENT
LINKS – PERSON(S) to EVENTLINKS – PERSON(S) to EVENT
20
EVENT – EVENT – birth of Rachelbirth of RachelPERSON’s – PERSON’s – SarahSarah and and
RachelRachel
21
3 Run a SECOND PASS to 3 Run a SECOND PASS to match individuals and to match individuals and to
link familieslink families FORMULATE RULES FORMULATE RULES
in Rule Engine language for RDF-data file in Rule Engine language for RDF-data file
Match individualsMatch individuals
Check family dataCheck family data
Link families upLink families up
APPLY RULES through the Java Rules APIAPPLY RULES through the Java Rules API
22
44 Evaluate and Optimize Evaluate and Optimize ResultsResults
Evaluate the preliminary resultsEvaluate the preliminary results
Optimize the rulesOptimize the rules
Improve the whole processImprove the whole process
23
VALIDATION IVALIDATION IClassification by Record Type:Classification by Record Type:
RECALL = .769 RECALL = .769 240 entries CORRECTLY LABELED ‘BIRTH’240 entries CORRECTLY LABELED ‘BIRTH’
________________________________________________________________________________
312 entries ACTUAL BIRTHS312 entries ACTUAL BIRTHS
PRECISION = .976PRECISION = .976240 entries CORRECTLY LABELED ‘BIRTH’240 entries CORRECTLY LABELED ‘BIRTH’
________________________________________________________________________________
246 Entries TOTAL LABELED ‘BIRTH’246 Entries TOTAL LABELED ‘BIRTH’
The higher the number, the betterThe higher the number, the better
24
VALIDATION IIVALIDATION IICorrectness of the Extraction:Correctness of the Extraction:
RECALL = .95RECALL = .95950 entries CORRECTLY LABELED ‘NAME’950 entries CORRECTLY LABELED ‘NAME’
________________________________________________________________________________
1000 entries ACTUAL NAMES1000 entries ACTUAL NAMES
PRECISION = .969PRECISION = .969950 entries CORRECTLY LABELED ‘NAME’950 entries CORRECTLY LABELED ‘NAME’
________________________________________________________________________________
980 Entries TOTAL LABELED ‘NAME’980 Entries TOTAL LABELED ‘NAME’
The higher the number, the betterThe higher the number, the better
25
Isaac WOODBURYIsaac WOODBURY ChildrenChildren
1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar 1677/8Christian 3 Mar 1677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb 1682/3Deliverance 1 Feb 1682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan 1688Elizabeth 17 Jan 16888.8. Nickolas 12 Aug 1688Nickolas 12 Aug 16889.9. AnnAnn 29 Jun 168929 Jun 168910.10. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/211.11. Elisabeth about 1694Elisabeth about 169412.12. Isaac 20 Jul 1697Isaac 20 Jul 169713.13. Benjamin 20 Aug 1699Benjamin 20 Aug 1699
26
Isaac WOODBURYIsaac WOODBURY SON of SON of HUMPHREYHUMPHREY
Mary WILKESMary WILKES MARRIAGE 9 Oct MARRIAGE 9 Oct
16711671
1.1. Robert 4 Jul 1672Robert 4 Jul 16722.2. Mary 6 Oct 1674Mary 6 Oct 16743.3. Christian 3 Mar Christian 3 Mar
1677/81677/84.4. Isaac 6 Apr 1680Isaac 6 Apr 16805.5. Deliverance 1 Feb Deliverance 1 Feb
1682/31682/36.6. Joshua 1 Jan 1684/5Joshua 1 Jan 1684/57.7. Elizabeth 17 Jan Elizabeth 17 Jan
16881688
Isaac WOODBURYIsaac WOODBURY SON of SON of NICHOLASNICHOLAS
ElizabethElizabeth MARRIAGE ________MARRIAGE ________
1.1. Nickolas 12 Aug Nickolas 12 Aug 16881688
2.2. AnnAnn 29 Jun 168929 Jun 16893.3. Lidia 1 Feb 1691/2Lidia 1 Feb 1691/24.4. Elisabeth about Elisabeth about
169416945.5. Isaac 20 Jul 1697Isaac 20 Jul 16976.6. Benjamin 20 Aug Benjamin 20 Aug
16991699
27
VALIDATION IIIVALIDATION III
Grouping by FAMILY:Grouping by FAMILY:
total # merges + splits to correct families total # merges + splits to correct families after after 22ndnd PASS PASS
______________________________________________________________________total # merges + splits to correct families total # merges + splits to correct families
after after 11stst PASS PASS
The lower the number, the betterThe lower the number, the better
28
Optimize the RulesOptimize the Rules AddAdd
RemoveRemove
Fine-tuneFine-tune
Change the order Change the order
Improve the whole processImprove the whole processUntil the metrics no Until the metrics no
longer improvelonger improve
29
AUTOMATIC AUTOMATIC EXTRACTIONEXTRACTION
Unstructured Unstructured genealogical genealogical datadata
Searchable Searchable annotated annotated genealogical genealogical datadata
Families in Families in
RDF-data fileRDF-data file
Questions?Questions?