Temple University Digital Scholarship Center: Model of the Month Club: September 2015

Model of the Month ClubMeeting 1:

What is a model in DH? Example: Underwood et al., Understanding Genre

Essentially, all models are wrong, but some are useful.

--George Boxstatistician

1919-1913

What’s a model? (broadest definition)

A model is a simplified representation of something, and in principle models can be built out of words, balsa wood, or anything you like. In practice... statistical models are often equations that describe the probability of an association between variables.

Ted Underwood, Seven ways humanists are using computers to understand text.http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/

What kinds of models are we looking at in this workshop?


What kinds of model are we looking at today?


Understanding Genre in a Collection of a Million Volumes (Underwood et al.)Problem: Classification

Why is this a problem?

1)HathiTrust has poor genre metadata.

2)Volumes are generically heterogeneous.

Desired result: provide a way of sorting HathiTrust text data to make it useful for literary scholars

Classification as a form of machine learningIn general:

Data → training data → predictive classifier (model) → prediction → evaluation

This project:

Text → coded text → regularized logistic regression → prediction → 93.9% accurate

+hidden Markov smoothing

Data: TextWe began by obtaining full text of all public-domain English-language

works in HathiTrust between 1700 and 1922. Organizing a group of five readers,

we asked them to label individual pages in a total of 414 books; this produced

our training data. We transformed the text of all the books into counts of fea-

tures on each page; most of these features were words that we counted, but we

also recorded other details of page structure.Underwood: Understanding Genre Interim Report

Background: Bag of words

Training data: Coded text 223 volumes were tagged by five people, with assigned volume lists over-lapping so that almost all the pages in the volumes were read by at least tworeaders (and some by three). This strategy allowed us to make tentative esti-mates of human dissensus, which were invaluable. But it was a relatively slowprocess, because it required coordination. The remaining 191 volumes weresimply tagged at the page level by the PI. In cases where we had three readers,we resolved human disagreements by voting. In other cases, we accepted themore general genre tag, or the tag produced by more experienced readers.But:Selection of volumes: was probably the most questionable aspect of our methodology, and an area we will give more attention as we expand into the twentieth century.

Classification: Feature engineering We used 1062 features in our models. 1036 of them were words, or word cate-gories; a full list is available on Github: https://github.com/tedunderwood/genre/blob/master/data/biggestvocabulary.txt. In general, we selectedfeatures by grouping pages into the categories we planned to classify. We tookthe top 500 words from each category, and then grouped the words from allcategories into a master list that we could limit to the top N most frequentwords. This ensured that our list contained words like “8vo” and “ibid” thatmight be uncommon in the whole corpus, but extremely dispositive as cluesabout a particular class of pages. We normalized everything to lowercase (aftercounting certain forms of capitalization as “structural features”) and truncatedfinal apostrophe-s.

Classification: Regularized Logistic RegressionOnce we had designed this overall workflow, it was possible to plug dif-ferent classification algorithms into the page-level classification step of theprocess. We tried a range of algorithms here, including random forests andsupport vector machines. We also tried a range of different ensemble strate-gies, including strategies that combine multiple algorithms, before settling onan ensemble of regularized logistic models, trained by comparing each genreto all the other genres collectively.

Regularized Logistic Regressionname for a kind of classification algorithm: a set of assumptions &

mathematical processes designed to predict the likelihood that a given set of features occurring on a single page mean that page belongs to a specific genre

in general:

calculating the odds that, given the presence of a particular feature/set of features, a certain class is likely compared to the odds of instance being that class without those features

does not need or assume linear relationship between variables

does not assume a distribution

creates a decision boundary used to produce binary outcomes (yes or no, fiction or not fiction)

Example

http://courses.washington.edu/css490/2012.Winter/lecture_slides/05b_logistic_regression.pdf

+a Hidden Markov Modelassumes probability affected by immediate prior in a sequence & that

this is a hidden state (something external influencing instance probability)

used in this case to try to incorporate the fact that the genre of the volume has something to do with the volume of the page

From project:There are a variety of clever approaches that might be tried to coordi-nate page-level predictions with knowledge of volume structure. We

traineda hidden Markov model, which is is a relatively simple approach. The

modelcontains information about the probability of transition from one

genre toanother, so it is in a sense a model of volume structure. But in

practice, itsmain effect is to smooth out noisy single-page errors—for instance, it

was goodat catching a few isolated pages misclassified as nonfiction in the

middle of anovel.

Evaluation

Result

https://sharc.hathitrust.org/genre



Discussion

Temple University Digital Scholarship Center: Model of the Month Club: September 2015

Education

Transcript of Temple University Digital Scholarship Center: Model of the Month Club: September 2015