1
15 July 2007 (c) M.Greengrass
Data Extraction Across Multiple Text Datasets for Arts and Humanities Research
Mark GreengrassUniversity of Sheffield
Armadillo
2
15 July 2007 (c) M.Greengrass
Response to the RePAH questionnaire (2005-6), aggregate of all Arts and Humanities respondants (Repah: A User Requirements Analysis Report (2006), p. 102.
3
15 July 2007 (c) M.Greengrass
Repah, A user requirements analysis… (2006), p. 109
4
15 July 2007 (c) M.Greengrass
Some Distinctive Features of in Historians’ Approach to their Evidence
• Promiscuous range of sources consulted
• Firm distinction between primary and secondary sources
• Complex dialogue between existing historiography and constitutive source materials
• Reiterative process of open interrogation of source materials
• A ‘coherent’ narrative consists of one composed (generally) from more than one source
5
15 July 2007 (c) M.Greengrass
Historians’ Database Challenge•Growing number of (mainly text-based) historical datasets in electronic media, furnished from a wide variety of providers
• These datasets utilise a variety of different historical sources
• They contain varying amounts of encoded information (dependant on the historical question being asked by the PI; and by the constraints of the particular source being used)
• The information is encoded in different ways
• The delivery formats used also vary widely
6
15 July 2007 (c) M.Greengrass
7
15 July 2007 (c) M.Greengrass
Sources
Metropolitan London in the
1690sIHR
House of Lords JournalsBOPCRIS
St. Martin’s Settlement
Exams IndexWESTCAT
The Marine Society Registers
Collage image databse
Guildhall Library
Eighteenth Century Fire
Insurance Policies
Selected Criminal Records
TNA
John Strype’s “Survey…”
Prerogative Court of
Canterbury Wills
The Westminster Historical Database
Harben’s Dictionary of
London
The Proceedings of the Old Bailey AHDS Deposits
http://www.motco.com
8
15 July 2007 (c) M.Greengrass
The Old Bailey Proceedings: XML
<trial><p>
<person> <defend gender="m"><given>William</given><surname>Mawn</surname></defend> </person> was Tryed for <off> <theft type="animals">stealing a Bay Gelding price 20 l.</theft> </off> from one <victim gender="m"><given>Thomas</given><surname>Lane</surname></victim> out of Berkshire on the <cd>25th of April</cd>. The Witness swore that the Horse was found in the Prisoner's custody in Smithfield, which the Prosecutor owned to be his. The Prisoner could not produce any Evidence to prove that he came honestly by the Horse only produc'd a Felonious person, that was no stranger to Newgate, who went under the Notion of his Man, he declared that the Prisoner bought the Horse upon the Road beyond Uxbridge. The Prisoners being found in several faultering stories, he was found <verdict> <guilty>Guilty</guilty> </verdict>.</p> <p> <punish><death><note type="editorial">[Death. See summary.]</note></death></punish> </p> </trial>
9
15 July 2007 (c) M.Greengrass
Canterbury Wills: Delimited Text
2530553 W Agnes Kervill or Kervytt2530553 W Andrew Bridham London
2530553 W Andrew Pykeman London
2530553 W Austin Hawkyns2530553 W Cecilia Foster2530553 W Christian Chepman2530553 W Christian Cust2530553 W David Syadine Bristol,2530553 W Edmund Bybbesworth2530553 W Edward Wellys Hadley, 2530553 W Ellen Lacy Widow Saint Pe2530553 W GerardHeshull2530553 W Guy Shuldham2530553 W Helmingus Leget2530553 W Henry Porter2530553 W Henry Warlegh Keynesha2530553 W Henry Wellis2530553 W Hugh Caundyssh2530553 W Hugh Geynesburgh Rector2530553 W Isabelle Woodhill
10
15 July 2007 (c) M.Greengrass
The IssuesCan the technologies developed for the ‘semantic web’ help us:-
• To structure the (different) encoded information across varying sources in a way that the user community will find (research) fruitful?
• To understand the way in which these different sources relate to one another, such that they can be used in an intelligent fashion?
• To ‘bootstrap’ relevant historical/semantic information from one source, by using another?
11
15 July 2007 (c) Oscar Korcho (with acknowledgement)
Data ‘Sharing’ and Data ‘Re-use’
Reuse means to build new applications, assembling components already built
Sharing is when different applications use the same resources
12
15 July 2007 (c) O. Corcho (with acknowledgement)
Interaction Problem
Representing Knowledge for the purpose of solving some problem
is strongly affected by the nature of the problem
and the inference strategy to be applied to the problem
Bylander Chandrasekaran, B. Generic Tasks in knowledge-based reasoning.: the right level of abstraction for knowledge acquisition. In B.R. Gaines and J. H. Boose, EDs Knowledge Acquisition for Knowledge Based systems, 65-77, London: Academic Press 1988.
Problem Solving MethodsOntologies
Describe the reasoning process of a dataset
(‘Knowledge-Based System’) in
a domain-independent manner
Describe domain knowledge in a generic way
and provide agreed understanding of a domain
13
15 July 2007 (c) O. Corcho (with acknowledgement)
1. “An ontology defines the basic terms and relations
comprising the vocabulary of a topic area, as well as the
rules for combining terms and relations to define
extensions to the vocabulary”
Neches R, Fikes RE, Finin T, Gruber TR, Senator T, Swartout WR (1991) Enabling technology for knowledge sharing. AI Magazine 12(3):36–56
2. “An ontology is an explicit specification of a conceptualization”
Gruber TR (1993a) A translation approach to portable ontology specification. Knowledge Acquisition 5(2):199–220
3. “An ontology is a formal, explicit specification of a shared conceptualization”
4. “A logical theory which gives on explicit, partial account of a conceptualization”
5. “A set of logical axioms designed to account for the intended meaning of a vocabulary”
Guarino N (1998) Formal Ontology in Information Systems. In: Guarino N (ed) 1st International Conference on
Formal Ontology in Information Systems (FOIS’98). Trento, Italy. IOS Press, Amsterdam, pp 3–15
Definitions of an Ontology
Studer R, Benjamins VR, Fensel D (1998) Knowledge Engineering: Principles and Methods.IEEE Transactions on Data and Knowledge Engineering 25(1-2):161–197
Guarino N, Giaretta P (1995) Ontologies and Knowledge Bases: Towards a Terminological Clarification. In: Mars N (ed)Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing (KBKS’95). University of Twente,Enschede, The Netherlands. IOS Press, Amsterdam, The Netherlands, pp 25–32
14
15 July 2007 (c) M.Greengrass
Key Components of an Ontology
Concepts are organized in taxonomies
Relations
Functions
Axioms
Instances
R: C1 x C2 x ... x Cn-1 x Cn
F: C1 x C2 x ... x Cn-1 --> Cn
Elements
Sentences which are always true
Subclass-of: Concept 1 x Concept2Connected to: Component1 x Component2
Mother-of: Person --> WomenPrice of a used car: Model x Year x Kilometers --> Price
15
15 July 2007 (c) M.Greengrass, after Corcho
Shared human consensus
Implicit
Semantics hardwired; used at runtime
Formal(for humans)
Semantic Continuum and Formality
Text descriptions
Informal [explicit]
Semantics processed and used at runtime
Formal [for machines]
e.g. Language e.g. dictionaries e.g. library catalogues
E.g. see below
16
15 July 2007 (c) M.Greengrass
17
15 July 2007 (c) M.Greengrass
18
15 July 2007 (c) M.Greengrass
http://www.vicodi.org
19
15 July 2007 (c) M.Greengrass
Primary sources (historical documents; images; artefacts) in elecronic media
Web-based ‘secondary’ historical writing
‘top-down ontologies’ (generated from discipline-accepted taxonomies)
‘bottom-up ontologies’ (generated from a representative sample of canonical data
‘middle-out ontologies’ (generated by intelligent iteration)
20
15 July 2007 (c) M.Greengrass
21
15 July 2007 (c) M.Greengrass
John Wilkins, An Essay towards a Real Character and a Philosophical Language (1668)
22
15 July 2007 (c) M.Greengrass
23
15 July 2007 (c) M.Greengrass
24
15 July 2007 (c) M.Greengrass
25
15 July 2007 (c) M.Greengrass
26
15 July 2007 (c) M.Greengrass
27
15 July 2007 (c) M.Greengrass
28
15 July 2007 (c) M.Greengrass
29
15 July 2007 (c) M.Greengrass
30
15 July 2007 (c) M.Greengrass
Armadillo – a Semantic Agent
Retrieves information according to pre-agreed ontologies
Takes account of deviations in spelling, typographic formatting and contextual information
Makes use of delimited fields and tagged data as ‘oracles’ to provide firm instantiations of elements in an ontology to apply to electronic materials which have no such structure
31
15 July 2007 (c) M.Greengrass
32
15 July 2007 (c) M.Greengrass
33
15 July 2007 (c) M.Greengrass
34
15 July 2007 (c) M.Greengrass
35
15 July 2007 (c) M.Greengrass
36
15 July 2007 (c) M.Greengrass
37
15 July 2007 (c) M.Greengrass
38
15 July 2007 (c) M.Greengrass
39
15 July 2007 (c) M.Greengrass
40
15 July 2007 (c) M.Greengrass
41
15 July 2007 (c) M.Greengrass
<p>CENTRAL CRIMINAL COURT,</p><p>Held on Monday, December 17th, 1866, and following days,</p><p><sc>BEFORE THE RIGHT HON.</sc> <lc><name role="judiciary" given="THOMAS" surname="GABRIEL" sex="m" age="na">THOMAS GABRIEL</name>, LORD MAYOR</lc> of the City of London; Sir <sc><name role="judiciary" given="JOHN" surname="MELLOR" sex="m" age="na">JOHN MELLOR</name></sc>, Knt., one of the Justices of Her Majesty's Court of Queen's Bench; <sc><name role="judiciary" given="WILLIAM TAYLOR" surname="COPELAND" sex="m" age="na">WILLIAM TAYLOR COPELAND</name></sc>, Esq., <sc><name role="judiciary" given="THOMAS" surname="CHALLIS" sex="m" age="na">THOMAS CHALLIS</name></sc>, Esq., <sc>THOMAS QUESTED FINNIS</sc>, Esq., Sir <sc><name role="judiciary" given="ROBERT WALTER" surname="CARDEN" sex="m" age="na">ROBERT WALTER CARDEN</name></sc>, Knt., and <sc><name role="judiciary" given="WILLIAM" surname="LAWRENCE" sex="m" age="na">WILLIAM LAWRENCE</name></sc>, Esq., Aldermen of the said City;
Automated Text-Mining, used for tagging purposes in Central Criminal Court records
42
15 July 2007 (c) M.Greengrass
<p>CENTRAL CRIMINAL COURT,</p><p>Held on Monday, July 22nd, 1912, and following days.</p><p>Before the Right Hon. Sir <lc>THOMAS BOOR CROSBY, M.D., LORD MAYOR</lc> of the said City of London; the Right Hon. Lord <sc>COLERIDGE</sc>, one of the Justices of His Majesty's High Court; Sir <sc><name role="judiciary" given="HENRY" surname="KNIGHT" sex="m" age="na">HENRY KNIGHT</name></sc>, Knight; Sir <sc><name role="judiciary" given="HORATIO" surname="DAVIES" sex="m" age="na">HORATIO DAVIES</name></sc>, K.C.M.G.; Sir <sc><name role="judiciary" given="JOHN" surname="POUND" sex="m" age="na">JOHN POUND</name></sc>, Bart.; Sir <sc>GEORGE W. TRUSCOTT</sc>, Bart.; Sir <sc><name role="judiciary" given="CHARLES" surname="JOHNSTON" sex="m" age="na">CHARLES JOHNSTON</name></sc>, Knight; and Sir <sc>HORACE B. MARSHALL</sc>, Knight, LL.D., Aldermen of the said City; Sir <sc>FORREST FULTON</sc>, Knight, K.C., Recorder of the said City; Sir <sc>FK. ALBERT BOSANQUET</sc>, K.C., Common Serjeant of the said City;
Automated Text-Mining, used for tagging purposes in Central Criminal Court records – with less success!
Not identified
Not identified
Top Related