Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance...
-
Upload
james-baker -
Category
Education
-
view
402 -
download
0
Transcript of Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Research, High Performance...
Melissa Terras, JamesBaker, JamesHetherington, DavidBeavan, Martin ZaltzAustwick, Anne Welsh,Helen O'Neill, Will Finley,Oliver Duke-Williams, andAdam Farquhar
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.Exceptions: quotations, embeds from external sources, logos, and marked images.
Enabling ComplexAnalysis of Large-ScaleDigital CollectionsHumanities Research, High PerformanceComputing, and transforming access toBritish Library Digital Collections Data, code, viz: github.com/UCL-
dataspring
OverviewBarriers to computational approaches:
● fragmentation of communities,resources, and tools;
● lack of interoperability;● lack of technical skills
Data, code, viz: github.com/UCL-dataspring
Method60k books from the British Library:
● 17th - 19th century● 224GB compressed ALTO XML● UCL High Performance Computing● 4 humanities researchers● Research questions tocomputational queries
Data, code, viz: github.com/UCL-dataspring
Data, code, viz: github.com/UCL-dataspring
UCL’s Legion Cluster supercomputing facility. Photo: Tony Slade, © UCL Creative Media Services (all rights reserved)
Method60k books from the British Library:
● 17th - 19th century● 224GB compressed ALTO XML● UCL High Performance Computing● 4 humanities researchers● Research questions tocomputational queries
Data, code, viz: github.com/UCL-dataspring
ResultsIt worked!:
● Case Study 1: History of Medicine● Case Study 2: History of Images● Technical barriers● Search ‘recipes’
Data, code, viz: github.com/UCL-dataspring
Case Study 1History of Medicine Oliver Duke-Williams, UCL
Data, code, viz: github.com/UCL-dataspring
TechnicalMajor sticking point:
● Using humanities data on HPCsBest practice recommendations:
● Derived datasets● Normalisations● Documentating decisions● Fixed/defined dataset
Data, code, viz: github.com/UCL-dataspring
Generic searches:● for all variants of a word● that return keywords in contexttraced over time
● for a word or phrase that ignoreanother word or phrase
● for a word when in close proximityto word a second word
● based on image metadata
Data, code, viz: github.com/UCL-dataspring
ConclusionsRecommendations for enablingcomplex analysis of large-scale digitalcollections in the humanities:
● 1 Invest in research software engineer capacityto deploy and maintain openly licensed large-scale digital collections from across the GLAMsector in order to facilitate research in the arts,humanities and social and historical sciences,
● 2 Invest in training library staff to run these initialqueries in collaboration with humanities faculty,to support work with subsets of data that areproduced, and to document and manageresulting code and derived data.
Data, code, viz: github.com/UCL-dataspring
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Exceptions: quotations, embeds from external sources, logos, and marked images.
Special thanks to UCLResearch Computing andBritish Library DigitalResearch for their hard workand support!
Data, code, viz: github.com/UCL-dataspring
Melissa Terras, JamesBaker, JamesHetherington, DavidBeavan, Martin ZaltzAustwick, Anne Welsh,Helen O'Neill, Will Finley,Oliver Duke-Williams, andAdam Farquhar