BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT...
-
Upload
samantha-osborne -
Category
Documents
-
view
214 -
download
0
Transcript of BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT...
![Page 1: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/1.jpg)
BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA
NUSHRAT KHAN
Oxford-Illinois Digital Libraries Placement Programme
![Page 2: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/2.jpg)
2
ABOUT EEBO-TCP- Collaboration between the universities of Oxford and
Michigan from 1999-2015
- Early English Texts between 1473-1700
- 25000 texts made available online
- Full text searching available through EEBO-TCP Database
![Page 3: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/3.jpg)
3
WHY HISTORIC TEXTS ARE INTERESTING
Historic Datasets
Accessibility
Reveal Historical
Information
Semantic Web
Technical Interoperability
Future Research
![Page 4: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/4.jpg)
4
WORKSET CONSTRUCTOREnables workset creation from Person, Place, Subject, Genre and Dates parameters (http://eeboo.oerc.ox.ac.uk/)
![Page 5: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/5.jpg)
5
HOW DOES IT WORK?
Metadata extracted from TEI Data Clean Up Link the Data
Workflow of Publishing Structured Metadata
Available Metadata Fields
• Title
• Author Name
• Date (Precise Birth, Precise Death, precise-floruit-from, precise-floruit-to, precise-floruit-to)
• Raw Publication Place
• Raw Publication Date
• Publisher
![Page 6: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/6.jpg)
6
SAMPLE PUBLISHER DATAPublisher
By Rycharde Iugge, printer to the Quenes Maiestie,
Printed by I[ohn] C[harlewood] for Iohn Hinde, dwelling in Paules Church-yarde, at the signe of the golden Hinde,
Printed by Benjamin Took and John Crook, and are to be sold by Mary Crook & Andrew Crook ...,
Printed by Peter Smith, and at Saint-Omer at the English College Press],
s.n.],
[By J. Charlewood] for Edward White, dwelling at the little North doore of Paules Church, at the signe of the Gunne,
Imprinted by Richard Field, and are to be sold by Richard Garbrand [, Oxford],
[By I. Jaggard?] for M. S[parke.,
Imprinted by E: G[riffin]: for Iohn Budge, and Ralph Mab,
By [J. King for?] Iohn waley dwellyng in Foster lane,
![Page 7: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/7.jpg)
7
INSIDE THE DATA
Work
Printed By
Sold At
Printed For
Sold By
Printed At
?
:
.
[ ]
,
…[ ]?
“”,.
![Page 8: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/8.jpg)
8
WORKFLOW
Data Cleaning
Named Entity Extraction (Person – Printed by, Printed for and Sold by)
Storing Triples and generate RDF
Happy Querying !
![Page 9: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/9.jpg)
9
ENTITY RECOGNITION APPROACHES
NLTK Entity Extractor
Regular Expression
![Page 10: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/10.jpg)
10
REVERB
For automatically identifying and extracting binary relationships from English sentences
Input Output
Argument1, Relation Phrase, Argument2Raw text
Bananas are an excellent source of potassium
(bananas, be source of, potassium)
![Page 11: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/11.jpg)
11
OPEN CALAIS
Not as efficient on short textsi.e. Printed by A. Bells
Input text too short
Example Sentence:Printed by Melchisedech Bradwood for William Aspley
Cannot detect as a person
![Page 12: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/12.jpg)
12
NLTK ENTITY RECOGNIZER
Step 1 Extracted all the entities labeled as PERSON for each sentence
work_000001|Rycharde Iuggework_000003|Paulswork_000004|Iohn Charlewoodwork_000004|Iohn Hindework_000005|Ioan Danterwork_000006|Francis Grovework_000007|Henry Godduswork_000008|Arthur Iohnsonwork_000012|Leonard Lichfieldwork_000013|Langly Curtiswork_000014|Benjamin Tookwork_000014|John Crookwork_000014|Mary Crookwork_000014|Andrew Crookwork_000015|William Keblewhite
All the entities NLTK can
extract for each record
(with some limitations)
![Page 13: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/13.jpg)
13
LIMITATIONS OF NLTK
• NLTK does not identify initials as names, i.e A. B.
• Extracts only the surname in the expressions like A. Bells, Edw: Allde
• Identifies the word “Printer” in sentences where it’s mentioned in capital letters after ‘by’. i.e Printed by John Bill, Printer to the King's most Excellent Majesty
• In case of complex sentences containing multiple names it cannot detect and extract all the names efficiently
![Page 14: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/14.jpg)
14
FINDING RELATIONSHIPS WITHIN SENTENCES("'Printed", 'JJ')('by', 'IN')(PERSON Benjamin/NNP Took/NNP)('and', 'CC')(PERSON John/NNP Crook/NNP)('and', 'CC')('are', 'VBP')('to', 'TO')('be', 'VB')('sold', 'VBN')('by', 'IN')(PERSON Mary/NNP Crook/NNP)('&', 'CC')(PERSON Andrew/NNP Crook/NNP)
Look for preceding
preposition
Separate the entities based on
‘by’ or ‘for’
“You're having a hard time because it's hard. This is really not an easy task to approach. – jonrsharpe Jul 31 '14"
![Page 15: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/15.jpg)
15
DATA REFINING
Printed & Sold by Sold
ByPrinted
By
Printed For
De-duplicate the ‘Sold by’ Put back the
ones in ‘Printed and Sold by’
Extracted separately using ‘Regex’
![Page 16: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/16.jpg)
16
GENERATING UNIQUE URI
Ideal case : Assign unique URI to the same person
Exception in this case:
• Few authoritative sources to refer to
• Time consuming validation
• Very limited information about each person available
Assigned unique URI to every instance
Python uuid module – uuid4() function
![Page 17: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/17.jpg)
17
WORKING WITH ONTOLOGY
Checked existing ontologies for ‘Printed by’ and ‘Printed for’ relationships --- MODS, MADS, BibFrame etc
EEBOO Ontology
Modify the existing ontology to define the new relationships
Work
Author
Printed By
Printed For
Sold By
![Page 18: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/18.jpg)
18
STORING TRIPLES AND GENERATING RDF
![Page 19: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/19.jpg)
19
QUERYING ON THE DATA 1Top 20 Publishers Top 20 Printed for Top 20 Sold By
![Page 20: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/20.jpg)
20
QUERYING ON THE DATA 2
The sellers for the works published by Henri Hills
Both Printed and Sold by Henri Hills
Sellers who worked with Henri Hills-Will Larner, Jane Underhill, Francis Smith
![Page 21: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/21.jpg)
21
FUTURE DIRECTION
• Train NLTK to capture the names properly
• Extract specific place names from the publisher field. i.e. sold at Golden Hinde
• In case of initials figure out how to identify the names, i.e. whether R. Charles is Robert Charles or Ruth Charles etc. May be request help from domain expert
• Analyze how name expressions have changed over time
• Identify the authors using authoritative sources and domain specific knowledge, i.e. London Book Trades Index, British Book Trade Index
• Analyze and visualize the data by mapping
![Page 22: BABY ELEPHÃT – BUILDING AN ANALYTICAL BIBLIOGRAPHY FOR A PROSOPOGRAPHY IN EARLY ENGLISH IMPRINT DATA NUSHRAT KHAN Oxford-Illinois Digital Libraries Placement.](https://reader035.fdocuments.in/reader035/viewer/2022070414/5697c01a1a28abf838cceedf/html5/thumbnails/22.jpg)
22
GRATITUDE
• Terhi Nurmikko-Fuller
• David M. Weigl
• Professor David De Roure
• Kevin Page
• Pip Willcox
And everybody else at OeRC!