Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013...

6
Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 [email protected]

Transcript of Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013...

Page 1: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Digitizing Serialized FictionKirk HessLibrary Research ShowcaseNovember 19, [email protected]

Page 2: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Finding Serialized Fiction

“Many of the newspapers in Farm, Field and Fireside published serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “

How can we identify serialized fiction in a article-segmented newspaper archive?

Page 3: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Methodology• Manually extraction/indexing one title

- The Farmer’s Wife• Workflow: http://bit.ly/1aCYZSa • TEI/Scripto (OCR Correction)

• Automated techniques• Common N-Grams

• e.g. ‘Chapter (number/roman numeral)’, ‘To Be Continued’, ‘the end’, etc.

• Topic/Genre/Theme • e.g.Romance, children stories, holidays,

etc.• Named entity recognition• Predictive solutions (Bayes, Google API)

THE MYSTERIOUS MCCORKLES by F. Roney Weirhttp://uller.grainger.uiuc.edu/omeka/items/show/20

Page 4: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Analysis/Results• Manual Indexing Farmer’s Wife w/Omeka

Sample set completed Fall, 2012http://uller.grainger.illinois.edu/omeka/

• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al. w/Mallet (http://mallet.cs.umass.edu/

Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread

• Network Analysis w/GephiTopics and Documents are nodes, docs intopics are edges.

• Named Entity Recognition (NER) w/Stanford NLP Named Entity RecognizerProper names interfere with LSA, Programmatically find names

Page 5: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Analysis/Results (cont.)• Naïve Bayes Classifier using NLTK toolkit

• Similar to Movie Review sample using a small subset of articles, Naïve Bayes Classifier using NTLK, top 2000 words>>> classifier.show_most_informative_features(5) contains(having) = True fictio : nonfic = 1.9 : 1.0 contains(plan) = True fictio : nonfic = 1.9 : 1.0 contains(growing) = True fictio : nonfic = 1.9 : 1.0 contains(entertaining) = True fictio : nonfic = 1.9 : 1.0 contains(home) = True fictio : nonfic = 1.9 : 1.0

High accuracy (> .95) but weak ratios

Page 6: Digitizing Serialized Fiction Kirk Hess Library Research Showcase November 19, 2013 kirkhess@illinois.edu.

Next Steps• Implement Veridian• Crowdsource OCR correction• Automated tagging of articles• Direct access to index (Solr)

• Continue NLP research using NLTK Toolkit w/ additional classifiers and NER research, full training set.

• Expand probalistic statistical methods across archive (~ 1 million pages, 5 million articles).