Large-Scale Computational Research in Arts & Humanities

52
Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media John Coleman Faculty of Linguistics, Philology and Phonetics University of Oxford The future of research? 19/10/10

description

Presented by John Coleman at the JISC Future of Research conference, 19th October 2010

Transcript of Large-Scale Computational Research in Arts & Humanities

Page 1: Large-Scale Computational Research in Arts & Humanities

Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media

John ColemanFaculty of Linguistics, Philology and PhoneticsUniversity of Oxford

The future of research? 19/10/10

Page 2: Large-Scale Computational Research in Arts & Humanities

• What am I talking about? (Show and tell)

• How is it affecting the A/H research landscape?

• Implications (cost, strategy etc)

Page 3: Large-Scale Computational Research in Arts & Humanities

nce upon a time ...

There she weaves by night and day

A magic web with colours gay

Page 4: Large-Scale Computational Research in Arts & Humanities

1967: writing

Important/jj as/cs was/bedz Mr./np O'Donnell's/np$ essay/nn ,/, his/pp$ thesis/nn is/bez so/ql restricting/jj as/cs to/to deny/vb Faulkner/np the/at stature/nn which/wdt he/pps obviously/rb has/hvz ./. He/pps and/cc also/rb Mr./np Cowley/np and/cc Mr./np Warren/np have/hv fallen/vbn to/in the/at temptation/nn which/wdt besets/vbz many/ap of/in us/ppo to/to read/vb into/in our/pp$ authors/nns --/-- Nathaniel/np Hawthorne/np ,/, for/in example/nn ,/, and/cc Herman/np Melville/np --/-- protests/vbz against/in modernism/nn ,/, material/jj progress/nn ,/, and/cc science/nn which/wdt are/ber genuine/jj protests/nns of/in our/pp$ own/jj but/cc may/md not/* have/hv been/ben theirs/pp$$ ./. Faulkner's/np$ total/nn works/nns today/nr ,/, and/cc in/in fact/nn those/dts of/in his/pp$ works/nns which/wdt existed/vbd in/in 1946/cd when/wrb Mr./np Cowley/np made/vbd his/pp$ comment/nn ,/, or/cc in/in 1939/cd ,/, when/wrb Mr./np O'Donnell/np wrote/vbd his/pp$ essay/nn ,/, reveal/vb no/at such/jj simple/jj attitude/nn toward/in the/at South/nr-tl ./. If/cs he/pps is/bez a/at traditionalist/nn ,/, he/pps is/bez an/at eclectic/jj traditionalist/nn ./. If/cs he/pps condemns/vbz the/at recent/jj or/cc the/at present/nn ,/, he/pps condemns/vbz the/at past/nn with/in no/ql less/ap force/nn ./. If/cs he/pps sees/vbz the/at heroic/jj in/in a/at Sartoris/np or/cc a/at Sutpen/np ,/, he/pps sees/vbz also/rb --/-- and/cc he/pps shows/vbz --/-- the/at blind/jj and/cc the/at mean/jj ,/, and/cc he/pps sees/vbz the/at Compson/np family/nn disintegrating/vbg from/in within/rb ./. He/pps is/bez not/* one/cd to/to remain/vb more/ql comfortably/rb and/cc

Page 5: Large-Scale Computational Research in Arts & Humanities

XML TEI: still writing

<inscript id="halu0001"><sourceDesc><physObj type="ashlar" engrave="engraved" color=""><desc>Negev. Elusa (Haluza). 100-299 CE. Limestone ashlar dressed as a tabula ansata. </desc><letterHgt min="1.7" max="3.5">1.7-3.0 cm Aramaic, 2.5-3.5 cm Greek</letterHgt><dateRange calendar="Gregorian" from="100" to="299">100 CE to 299 CE<note>Based on the Greek and Palmyrene script.</note></dateRange><discovery><place region="Negev" city="Elusa (Haluza)" site="Foundation of an abandoned Beduin structure" locus=""><note>The inscription consists of two lines of Greek followed by one of

Page 6: Large-Scale Computational Research in Arts & Humanities
Page 7: Large-Scale Computational Research in Arts & Humanities

VRE for the Study of Documents and Manuscripts: writing

A trial of Kathryn Sutherland’s Jane Austen manuscript project; supported by CCH, King's College London but here ported to our VRE-SDM demonstrator

Page 8: Large-Scale Computational Research in Arts & Humanities

From writing to video

Page 9: Large-Scale Computational Research in Arts & Humanities

2005

Page 10: Large-Scale Computational Research in Arts & Humanities

2008

YouTube surpasses Yahoo as world’s #2 search engine

Page 11: Large-Scale Computational Research in Arts & Humanities

Researchers of the future

• Just as comfortable creating multimodal online content

- video, games, websites etc –

as writing essays

• Online video is interesting; TV is boring (passive)

Page 12: Large-Scale Computational Research in Arts & Humanities

Texts

Page 13: Large-Scale Computational Research in Arts & Humanities

Texts

Page 14: Large-Scale Computational Research in Arts & Humanities

Texts

Page 15: Large-Scale Computational Research in Arts & Humanities

Texts

Здесь Будет ]ружен Памятник ]божденный т[

Page 16: Large-Scale Computational Research in Arts & Humanities

Texts

Здесь Будет ]ружен Памятник ]божденный т[руд]

Page 17: Large-Scale Computational Research in Arts & Humanities

Texts

Здесь Будет ]ружен Памятник ]божденный т[руд]

Here will be erected a monument to liberated labour

Page 18: Large-Scale Computational Research in Arts & Humanities

Vocal tract movements in speech

Page 19: Large-Scale Computational Research in Arts & Humanities

Resonance tuning in soprano singing and vocal tract shaping

Erik Bresch, Speech Production and Articulation kNowledge GroupUniversity of Southern California

Page 20: Large-Scale Computational Research in Arts & Humanities

Mining a Year of Speech:a “Digging into Data” project

http://www.phon.ox.ac.uk/mining/

Page 21: Large-Scale Computational Research in Arts & Humanities

John Coleman Greg Kochanski

Ladan Ravary Sergio GrauOxford University Phonetics Laboratory

Lou Burnard

Jonnie RobinsonThe British Library

with support from

Page 22: Large-Scale Computational Research in Arts & Humanities

Mark LibermanJiahong YuanChris Cieri

Phonetics Laboratoryand Linguistic Data ConsortiumUPenn

with support from NSF

Page 23: Large-Scale Computational Research in Arts & Humanities

The “Digging into Data” challenge“The creation of vast quantities of Internet

accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ...

With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.”

Page 24: Large-Scale Computational Research in Arts & Humanities

The “Year of Speech”• A grove of corpora, held at various sites with a

common indexing scheme and search tools

• US English material: 2,240 hrs of telephone conversations

• 1,255 hrs of broadcast news

• As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates

• British English: Spoken part of the British National Corpus, 10 million words of transcribed speech

• Recently digitized by collaboration with British Library

Page 25: Large-Scale Computational Research in Arts & Humanities

C-SPAN

• US cable TV channel covering Senate/House proceedings, committees, current affairs discussion shows

• 20-year archive of publicly open video

• Large parts of the proceedings officially transcribed and published

Page 26: Large-Scale Computational Research in Arts & Humanities

Digging for audio: kinds of questions someone might ask

1. When did X say Y?

For example, "find the video clip where George Bush said 'read my

lips'."

2. How do arguments work?

For example, how do different people handle interruptions?

3. How frequent are linguistic features such as phrase-final rising

intonation ("uptalk") across different age groups, genders, social

classes, and places?

4. Who says “ask” and who says “aks”?

Page 27: Large-Scale Computational Research in Arts & Humanities

British National Corpus• Collected in early 1990s by consortium of

dictionary makers (Collins, Longman, OUP) and academics (Oxford, Lancaster, Oslo-Bergen)

• 100m word text (XML) corpus, of which 10m is transcribed speech

• c. 4.2 m words is demographically-sampled recordings of unplanned conversations

• British Market Research Bureau loaned Sony Walkmans to recruits

• c. 5 m words is “context-governed” speech (educational, business, public speeches/meetings, 'leisure' – sports, clubs, broadcast, phone-ins etc)

• Transcribed by audio typists and structured in XML database with rich metadata annotations

Page 28: Large-Scale Computational Research in Arts & Humanities

A few speech samples from the BNC

• A domestic drama

• Political commentary/current affairs

• Are dogs people too?

Page 29: Large-Scale Computational Research in Arts & Humanities

Practicalities

• In order to be of much practical use, such very large corpora must be indexed at word and phoneme level

• All included speech corpora must therefore have associated text transcriptions

• We use the Penn Phonetics Laboratory Forced Aligner to associated each word and segment with the corresponding start and end points in the sound files

Page 30: Large-Scale Computational Research in Arts & Humanities

'Speech in the wild'

Page 31: Large-Scale Computational Research in Arts & Humanities

'Speech in the wild'

Page 32: Large-Scale Computational Research in Arts & Humanities

Rethinking language• Dogs• Parrot talk (to/about, not by)• Talk to inanimate objects• We can look forward to ...

Listen they were all going [belch] that ain't a burp he said

Like I'd be talking like this and suddenly it'll go [mimics microphone noises]

He simply went [sound effect] through his nose

Page 33: Large-Scale Computational Research in Arts & Humanities

A future of research?

• Survey of audio-visual tools and resources in the Humanities (AHRC ICT Strategy project)

http://www.phon.ox.ac.uk/ictKey findings

– Growing but relatively poorly-supported use of audio & video in many subjects (Music, Modern Languages, Modern History, Archaeology, Classics, Art, Linguistics)

– Annotation, search and browse tools are essential– Digital data storage and processing power required

vastly outstrips text and photos, and is commensurate with e-Science grid computing

Page 34: Large-Scale Computational Research in Arts & Humanities

How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books

Photographic collections, film libraries, museum catalogues etc are pretty large nowadays

Page 35: Large-Scale Computational Research in Arts & Humanities

How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books

Photographic collections, film libraries, museum catalogues etc are pretty large nowadays

-------------- humanities

Page 36: Large-Scale Computational Research in Arts & Humanities

Why does big matter?• What kind of questions you can study

depends on the material you've got. (Obviously.)

• Humanities deals with rare and unique works and interpretations, not repeatable events.

• To study rare events/things and connections, it can be important to just have a lot of data – as much as possible – in order to have enough examples.

Page 37: Large-Scale Computational Research in Arts & Humanities

Rare(ish) events in English

• I’[n] trying 160 instances in BNC• See[n] to 310• Alar[ŋ] clock 18

• Swimmi[m] pool 44• Getti[m] paid 19• Weddi[m] present 15

Page 38: Large-Scale Computational Research in Arts & Humanities

Challenges: technology• Amount of material

• Storage– CD quality: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 16 acoustic analysis parameters: 1.44

MB/hour– 2.8 GB/day– 85 GB/month– 1.02 TB/year

• Computing – distance measures, etc.– alignment of labels– searching and browsing

Page 39: Large-Scale Computational Research in Arts & Humanities

Challenges: technology• Storing 1.02 TB/year: not really a problem

in 21st century

• 1 TB (1000 GB) hard drive costs c. £65

• Computing (distance measures, alignments, labels etc): multiprocessor cluster

Page 40: Large-Scale Computational Research in Arts & Humanities

Collaboration, not collection

Search interface 2(e.g. BL)

Search interface 1(e.g. Oxford)

Search interface 3(e.g. Penn)

Search interface 4(e.g. Lancaster ?)

BNC-XML database - retrieve time stamps

Spoken BNCrecordings - BL sound server(s)

LDC database - retrieve time stamps

Spoken LDCrecordings - various locations

Page 41: Large-Scale Computational Research in Arts & Humanities

Collaboration, not collection

Search interface 2(e.g. BL)

Search interface 1(e.g. Oxford)

Search interface 3(e.g. Penn)

Search interface 4(e.g. Lancaster ?)

BNC-XML database - retrieve time stamps

Spoken BNCrecordings - BL sound server(s)

LDC database - retrieve time stamps

Spoken LDCrecordings - various locations

Database of time stamps produced using consistent indexing standards

Your recordings - whatever location

Page 42: Large-Scale Computational Research in Arts & Humanities

Challenges: dispersal/aggregation• Dispersed resources; grid computing

• Need for international standards (for authorisation etc.)

• Humanities research may require new support structures (cf. 'big science' comparisons)

• 'Federated library' or 'national research laboratory' models?

Page 43: Large-Scale Computational Research in Arts & Humanities

Challenges: dispersal/technology• Finding stuff

• Doing something with it

• Transformation, new interpretations

Page 44: Large-Scale Computational Research in Arts & Humanities

Challenges: human

• Human aspect more important than hardware

• Who is qualified to carry out such work?• Employment prospects• What training provision is required?

• Should training in computer programming become normal for arts/humanities students?

Page 45: Large-Scale Computational Research in Arts & Humanities

Possible impacts• Will open up Year of Speech data and tools to linguistics,

phonetics, speech communication, oral history, education

• Automatic and reliable indexing of spoken on-line materials would be a “killer app”

• Caveat … it is practically impossible to predict the impact of developments in the market (cf. Microsoft, Google, YouTube)

• or that come to market (transistors, lasers, holograms).

• So it’s even harder to reliably predict impacts of cutting-edge research

Page 46: Large-Scale Computational Research in Arts & Humanities

Thank you for your time and attention

http://www.phon.ox.ac.uk/mining/http://bvreh.humanities.ox.ac.uk/

http://www.phon.ox.ac.uk/ict

Page 47: Large-Scale Computational Research in Arts & Humanities
Page 48: Large-Scale Computational Research in Arts & Humanities

Spoken Babylonian

Martin Worthington: Babylonian and Assyrian Poetry and Literature: An Archive of Recordings

http://people.pwf.cam.ac.uk/mjw65/BAPLAR/Archive

The Righteous Sufferer (Ludlul bēl nēmeqi), part of Tablet II, read by Margaret Jaques Cavigneaux

Page 49: Large-Scale Computational Research in Arts & Humanities

Babylonian Karaoke

1 šattamma ana balāṭ adanna īteq1 One whole year to the next! The appointed time passed.

2 asaḫḫurma lemun lemunma3 zapurtī ūtaṣṣapa išartī ul uttu

2 As I turned around, it was more and more terrible;3 My ill luck was on the increase, I could find no good fortune.

4 ila alsīma ul iddina pānīšu5 usalli ištarī ul ušaqqâ rēšīša

Page 50: Large-Scale Computational Research in Arts & Humanities
Page 51: Large-Scale Computational Research in Arts & Humanities

Visualisation using 3-D/4-D models

Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab

Page 52: Large-Scale Computational Research in Arts & Humanities

Visualisation using 3-D/4-D models

Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab