Large-Scale Computational Research in Arts & Humanities
description
Transcript of Large-Scale Computational Research in Arts & Humanities
Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media
John ColemanFaculty of Linguistics, Philology and PhoneticsUniversity of Oxford
The future of research? 19/10/10
• What am I talking about? (Show and tell)
• How is it affecting the A/H research landscape?
• Implications (cost, strategy etc)
nce upon a time ...
There she weaves by night and day
A magic web with colours gay
1967: writing
Important/jj as/cs was/bedz Mr./np O'Donnell's/np$ essay/nn ,/, his/pp$ thesis/nn is/bez so/ql restricting/jj as/cs to/to deny/vb Faulkner/np the/at stature/nn which/wdt he/pps obviously/rb has/hvz ./. He/pps and/cc also/rb Mr./np Cowley/np and/cc Mr./np Warren/np have/hv fallen/vbn to/in the/at temptation/nn which/wdt besets/vbz many/ap of/in us/ppo to/to read/vb into/in our/pp$ authors/nns --/-- Nathaniel/np Hawthorne/np ,/, for/in example/nn ,/, and/cc Herman/np Melville/np --/-- protests/vbz against/in modernism/nn ,/, material/jj progress/nn ,/, and/cc science/nn which/wdt are/ber genuine/jj protests/nns of/in our/pp$ own/jj but/cc may/md not/* have/hv been/ben theirs/pp$$ ./. Faulkner's/np$ total/nn works/nns today/nr ,/, and/cc in/in fact/nn those/dts of/in his/pp$ works/nns which/wdt existed/vbd in/in 1946/cd when/wrb Mr./np Cowley/np made/vbd his/pp$ comment/nn ,/, or/cc in/in 1939/cd ,/, when/wrb Mr./np O'Donnell/np wrote/vbd his/pp$ essay/nn ,/, reveal/vb no/at such/jj simple/jj attitude/nn toward/in the/at South/nr-tl ./. If/cs he/pps is/bez a/at traditionalist/nn ,/, he/pps is/bez an/at eclectic/jj traditionalist/nn ./. If/cs he/pps condemns/vbz the/at recent/jj or/cc the/at present/nn ,/, he/pps condemns/vbz the/at past/nn with/in no/ql less/ap force/nn ./. If/cs he/pps sees/vbz the/at heroic/jj in/in a/at Sartoris/np or/cc a/at Sutpen/np ,/, he/pps sees/vbz also/rb --/-- and/cc he/pps shows/vbz --/-- the/at blind/jj and/cc the/at mean/jj ,/, and/cc he/pps sees/vbz the/at Compson/np family/nn disintegrating/vbg from/in within/rb ./. He/pps is/bez not/* one/cd to/to remain/vb more/ql comfortably/rb and/cc
XML TEI: still writing
<inscript id="halu0001"><sourceDesc><physObj type="ashlar" engrave="engraved" color=""><desc>Negev. Elusa (Haluza). 100-299 CE. Limestone ashlar dressed as a tabula ansata. </desc><letterHgt min="1.7" max="3.5">1.7-3.0 cm Aramaic, 2.5-3.5 cm Greek</letterHgt><dateRange calendar="Gregorian" from="100" to="299">100 CE to 299 CE<note>Based on the Greek and Palmyrene script.</note></dateRange><discovery><place region="Negev" city="Elusa (Haluza)" site="Foundation of an abandoned Beduin structure" locus=""><note>The inscription consists of two lines of Greek followed by one of
VRE for the Study of Documents and Manuscripts: writing
A trial of Kathryn Sutherland’s Jane Austen manuscript project; supported by CCH, King's College London but here ported to our VRE-SDM demonstrator
From writing to video
2005
2008
YouTube surpasses Yahoo as world’s #2 search engine
Researchers of the future
• Just as comfortable creating multimodal online content
- video, games, websites etc –
as writing essays
• Online video is interesting; TV is boring (passive)
Texts
Texts
Texts
Texts
Здесь Будет ]ружен Памятник ]божденный т[
Texts
Здесь Будет ]ружен Памятник ]божденный т[руд]
Texts
Здесь Будет ]ружен Памятник ]божденный т[руд]
Here will be erected a monument to liberated labour
Vocal tract movements in speech
Resonance tuning in soprano singing and vocal tract shaping
Erik Bresch, Speech Production and Articulation kNowledge GroupUniversity of Southern California
Mining a Year of Speech:a “Digging into Data” project
http://www.phon.ox.ac.uk/mining/
John Coleman Greg Kochanski
Ladan Ravary Sergio GrauOxford University Phonetics Laboratory
Lou Burnard
Jonnie RobinsonThe British Library
with support from
Mark LibermanJiahong YuanChris Cieri
Phonetics Laboratoryand Linguistic Data ConsortiumUPenn
with support from NSF
The “Digging into Data” challenge“The creation of vast quantities of Internet
accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ...
With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.”
The “Year of Speech”• A grove of corpora, held at various sites with a
common indexing scheme and search tools
• US English material: 2,240 hrs of telephone conversations
• 1,255 hrs of broadcast news
• As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates
• British English: Spoken part of the British National Corpus, 10 million words of transcribed speech
• Recently digitized by collaboration with British Library
C-SPAN
• US cable TV channel covering Senate/House proceedings, committees, current affairs discussion shows
• 20-year archive of publicly open video
• Large parts of the proceedings officially transcribed and published
Digging for audio: kinds of questions someone might ask
1. When did X say Y?
For example, "find the video clip where George Bush said 'read my
lips'."
2. How do arguments work?
For example, how do different people handle interruptions?
3. How frequent are linguistic features such as phrase-final rising
intonation ("uptalk") across different age groups, genders, social
classes, and places?
4. Who says “ask” and who says “aks”?
British National Corpus• Collected in early 1990s by consortium of
dictionary makers (Collins, Longman, OUP) and academics (Oxford, Lancaster, Oslo-Bergen)
• 100m word text (XML) corpus, of which 10m is transcribed speech
• c. 4.2 m words is demographically-sampled recordings of unplanned conversations
• British Market Research Bureau loaned Sony Walkmans to recruits
• c. 5 m words is “context-governed” speech (educational, business, public speeches/meetings, 'leisure' – sports, clubs, broadcast, phone-ins etc)
• Transcribed by audio typists and structured in XML database with rich metadata annotations
A few speech samples from the BNC
• A domestic drama
• Political commentary/current affairs
• Are dogs people too?
Practicalities
• In order to be of much practical use, such very large corpora must be indexed at word and phoneme level
• All included speech corpora must therefore have associated text transcriptions
• We use the Penn Phonetics Laboratory Forced Aligner to associated each word and segment with the corresponding start and end points in the sound files
'Speech in the wild'
'Speech in the wild'
Rethinking language• Dogs• Parrot talk (to/about, not by)• Talk to inanimate objects• We can look forward to ...
Listen they were all going [belch] that ain't a burp he said
Like I'd be talking like this and suddenly it'll go [mimics microphone noises]
He simply went [sound effect] through his nose
A future of research?
• Survey of audio-visual tools and resources in the Humanities (AHRC ICT Strategy project)
http://www.phon.ox.ac.uk/ictKey findings
– Growing but relatively poorly-supported use of audio & video in many subjects (Music, Modern Languages, Modern History, Archaeology, Classics, Art, Linguistics)
– Annotation, search and browse tools are essential– Digital data storage and processing power required
vastly outstrips text and photos, and is commensurate with e-Science grid computing
How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books
Photographic collections, film libraries, museum catalogues etc are pretty large nowadays
How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books
Photographic collections, film libraries, museum catalogues etc are pretty large nowadays
-------------- humanities
Why does big matter?• What kind of questions you can study
depends on the material you've got. (Obviously.)
• Humanities deals with rare and unique works and interpretations, not repeatable events.
• To study rare events/things and connections, it can be important to just have a lot of data – as much as possible – in order to have enough examples.
Rare(ish) events in English
• I’[n] trying 160 instances in BNC• See[n] to 310• Alar[ŋ] clock 18
• Swimmi[m] pool 44• Getti[m] paid 19• Weddi[m] present 15
Challenges: technology• Amount of material
• Storage– CD quality: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 16 acoustic analysis parameters: 1.44
MB/hour– 2.8 GB/day– 85 GB/month– 1.02 TB/year
• Computing – distance measures, etc.– alignment of labels– searching and browsing
Challenges: technology• Storing 1.02 TB/year: not really a problem
in 21st century
• 1 TB (1000 GB) hard drive costs c. £65
• Computing (distance measures, alignments, labels etc): multiprocessor cluster
Collaboration, not collection
Search interface 2(e.g. BL)
Search interface 1(e.g. Oxford)
Search interface 3(e.g. Penn)
Search interface 4(e.g. Lancaster ?)
BNC-XML database - retrieve time stamps
Spoken BNCrecordings - BL sound server(s)
LDC database - retrieve time stamps
Spoken LDCrecordings - various locations
Collaboration, not collection
Search interface 2(e.g. BL)
Search interface 1(e.g. Oxford)
Search interface 3(e.g. Penn)
Search interface 4(e.g. Lancaster ?)
BNC-XML database - retrieve time stamps
Spoken BNCrecordings - BL sound server(s)
LDC database - retrieve time stamps
Spoken LDCrecordings - various locations
Database of time stamps produced using consistent indexing standards
Your recordings - whatever location
Challenges: dispersal/aggregation• Dispersed resources; grid computing
• Need for international standards (for authorisation etc.)
• Humanities research may require new support structures (cf. 'big science' comparisons)
• 'Federated library' or 'national research laboratory' models?
Challenges: dispersal/technology• Finding stuff
• Doing something with it
• Transformation, new interpretations
Challenges: human
• Human aspect more important than hardware
• Who is qualified to carry out such work?• Employment prospects• What training provision is required?
• Should training in computer programming become normal for arts/humanities students?
Possible impacts• Will open up Year of Speech data and tools to linguistics,
phonetics, speech communication, oral history, education
• Automatic and reliable indexing of spoken on-line materials would be a “killer app”
• Caveat … it is practically impossible to predict the impact of developments in the market (cf. Microsoft, Google, YouTube)
• or that come to market (transistors, lasers, holograms).
• So it’s even harder to reliably predict impacts of cutting-edge research
Thank you for your time and attention
http://www.phon.ox.ac.uk/mining/http://bvreh.humanities.ox.ac.uk/
http://www.phon.ox.ac.uk/ict
Spoken Babylonian
Martin Worthington: Babylonian and Assyrian Poetry and Literature: An Archive of Recordings
http://people.pwf.cam.ac.uk/mjw65/BAPLAR/Archive
The Righteous Sufferer (Ludlul bēl nēmeqi), part of Tablet II, read by Margaret Jaques Cavigneaux
Babylonian Karaoke
1 šattamma ana balāṭ adanna īteq1 One whole year to the next! The appointed time passed.
2 asaḫḫurma lemun lemunma3 zapurtī ūtaṣṣapa išartī ul uttu
2 As I turned around, it was more and more terrible;3 My ill luck was on the increase, I could find no good fortune.
4 ila alsīma ul iddina pānīšu5 usalli ištarī ul ušaqqâ rēšīša
Visualisation using 3-D/4-D models
Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab
Visualisation using 3-D/4-D models
Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab