Mapping Implicit Processes: Extracting Social Networks from Digital Corpora
description
Transcript of Mapping Implicit Processes: Extracting Social Networks from Digital Corpora
M. H. BealsSheffield Hallam University
@mhbeals
Mapping Implicit Processes:Extracting Social Networks from Digital Corpora
View These Slides
About Me
Overview
Understanding Scissors-and-Paste Journalism in Georgian Britain
Computer-Aided Identification of Reprints and Memes
Understanding Dissemination Pathways
Manual Construction of Social Networks
Computer-Aided Ordering of Dissemination Pathways
Future Plans
Scissors-and-Paste Journalism in Georgian Britain
Proliferation of Colonial and Provincial PressesSpread of Journeyman Printers
Reduction of Stamp Duty
New Profit ModelsEntertaining and Literary Content
Adverts to Attract Readers to Sell to Advertisers
Manual Dissemination of NewsLimited Number of “Specials”
Postal Exchange, Subscriptions, Correspondence
No Telegraph until 1840s and Not Used for Miscellany
Computer-Aided Identification of Reprints & Memes
Promise Large-Scale Digitisation Efforts
Keyword Searching
nGram Matching (WCopyFind)
Edition Tracking (Juxta)
Viral Texts Project (Cordell, Dillon, and Smith) Large-Scale Corpus of Nineteenth Century Newspapers
Extensive, Automatic Repair of OCR Errors
Identification of Highly Reprinted Materials (Memes)
Discussion and Exploration of Meme Traits and and Patterns
PerilsDiscrete Digital Corpera
(Paywalls)
Offline Penumbra (Curation)
Lost Nodes (Incomplete Data)
OCR Variability (50-80%)
Computer-Aided Identification of Reprints & Memes
# concordanceset.pyimport redef replace_words(text, word_dic): rc = re.compile('|'.join(map(re.escape, word_dic))) def translate(match): return word_dic[match.group(0)] return rc.sub(translate, text)
def getNGrams(wordlist, n): return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]
basenumber = raw_input('What is the first id number? ’)number = str(basenumber)numberint = int(basenumber)basenumberend = raw_input('What is the last id number? ’)endnumber = int(basenumberend)
ngram = raw_input('How many words should be in a phrase? ’)ngrams = int(ngram)combifile = 'combine.txt’listopen = open(combifile, "r”)wordlist = listopen.read()splitlist = wordlist.split()listopen.close()ngramslist = getNGrams(splitlist, ngrams)
if ngramslist: ngramslist.sort() last = ngramslist[-1] for i in range(len(ngramslist)-2, -1, -1): if last == ngramslist[i]: del ngramslist[i] else: last = ngramslist[i]
tidystring = '’
for item in ngramslist: number = str(basenumber) numberint = int(basenumber) lineitem = " ".join(item) print lineitem tidystring += str('\n' + lineitem + ',')
while (numberint<=endnumber): file = str(number + ".txt”) fin = open(file, 'r’) text = fin.read() fin.close() if lineitem in text: tidystring += str(number + ',’) numberint = int(number) numberint += 1 number = str(numberint)
# create an excelfile for this exampleexcel_file = "ngramcompiled.csv”fout = open(excel_file, "w”)fout.write(tidystring)fout.close()
Computer-Aided Identification of Reprints & Memes
Understanding Dissemination Pathways
Meme Identification
Courtesy of Viral Texts Project, http://www.viraltexts.org/
Understanding Dissemination Pathways
Chronological Spread
Courtesy of Viral Texts Project, https://www.youtube.com/watch?v=YwDlyt7jhMs
Understanding Dissemination Pathways
Genealogical Model
Manual Construction of Social Networks
The Glasgow Advertiser, 7 October 1793, p. 5
Knoxville, May 11.IT is shocking to describe the bloody scenes thathave lately taken place in this district. TheIndians have killed and scalped a great number ofpersons, among whom is Colonel Isaac Bledose,who was massacred within 150 yards of his ownhouse.On the 27th instant a body of Indians attackedGreenfield station: they killed John Jervis, anda negro fellow, belonging to Mrs. Tarker. Bythe bravery of three young men, viz. William Nee-ly, William Wilson, and William Hall, the stationwas preserved; they killed two Indians, woundedseveral others, and put them to flight. It is to beremembered, that Neely and Hall had each lost afather and two brothers, and Wilson a brother, bythe savages. Men are now in pursuit of the Indi-ans.
Full Discussion of Dissemination Pathway Available at: http://prezi.com/in4_bqvgmanr/
Manual Construction of Social Networks
Derived from Glasgow News Archive, British Library 19th Century Newspapers,
NewspaperArchive.com, Readex Early American Newspapers, Newspapers.com, and the University of Kentucky
Computer-Aided Ordering of Dissemination Pathways
Binary Computer Model
Arbitrary Tolerance Levels
Reference to Additional Tables
Bypassing Missing Nodes
Flexibility
Difficult to Recreate Human Instinct…
…But is That a Bad Thing?
Computer-Aided Ordering of Dissemination Pathways
Phylogenetic Model
Image Courtesy of Fred Hsu (Wikipedia:User:Fredhsu on en.wikipedia) CC-BY-SA-3.0 via Wikimedia Commons
Future PlansComputer Program
OCR Clean-up ProcessesDivision into Likely Meme GroupingsVariety of Relatedness Scores
Textual IntegrityPrefixes and SuffixesChronological SeparationChronological-Geographical FeasibilityWell-Worn Path ModifierModeling of Relatedness Factors
Directional Social Network DatabaseRaw Data to Inform Additional Research
Manual CorrectionsDirect Attributions
Parsing Compilations
Initial Discovery of Well-Worn Paths
Inclusion of Offline Materials
www.mhbeals.com/cnd
M. H. BealsSheffield Hallam University
@mhbeals
Mapping Implicit Processes:Extracting Social Networks from Digital Corpora
View These Slides on Slideshare
About Mewww.
mhbeals.com