More HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign...

19
More HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign [email protected], [email protected]

Transcript of More HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign...

More HTRC

Loretta Auvil, Boris Capitanu

University of Illinois at Urbana-Champaign

[email protected], [email protected]

Outline

• HTRC Analysis

– Topic Modeling

– Spell Checking

Meandre Flow

Encapsulation and integration environment for tools and algorithms

Topic

Modeling

Topic Modeling Flow

Topic Modeling in HTRC

Topics for Jane Austen Workset

• Some of the topics from Jane Austen

Topic Modeling References• http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-

open-or-latent-dirichlet-allocation-for-english-majors/

• http://dsl.richmond.edu/dispatch/pages/intro

• http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/

• http://www.ics.uci.edu/~newman/pubs/JASIST_Newman.pdf

• https://dhs.stanford.edu/visualization/topic-networks/

• Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 96–104, Portland, OR, USA, 24 June 2011. © 2011 Association for Computational Linguistics

• Matthew Jockers, Macroanalysis: Digital Methods and Literary History, UIUC Press, 2013

• Termite: Visualization Techniques for Assessing Textual Topic Models, Jason Chuang, Christopher D. Manning, Jeffrey Heer, Advanced Visual Interfaces, 2012

• Mallet website: http://mallet.cs.umass.edu

• David Mimno’s website: http://www.cs.princeton.edu/~mimno/

Spell Checking

Spell Check in HTRC

Spell Check Report

Spell Check Replacement Rules

Spellchecking Analysis

• Not just OCR detection but OCR correction

• Can also be used for cleaning other messy data

Spell Check Flow

Demonstration

• HTRC Portal

– Topic Modeling

– Spellcheck

Learning Exercises (1)1. Run Meandre_Topic_Modeling Algorithm

A. Click on “Algorithms”

B. Click on “Meandre_Topic_Modeling”

1. Provide Job Name (required)

2. Select a Workset (required)

3. Adjust Additional Parameters (optional)

a. Provide the number of tokens to be displayed in the tagcloud (default: 200):

b. Provide the number of topics to be created (default: 10):

4. Click “Submit” button

C. Once Job finishes, select Job Name

D. View Results by clicking on “topic_tagclouds.html”

Learning Exercises (2)2. Run

Meandre_Spellcheck_Report_Per_VolumeA. Click on “Algorithms”

B. Click on “Meandre_Spellcheck_Report_Per_Volume”

1. Provide Job Name (required)

2. Select a Workset (required)

3. Adjust Additional Parameters (optional)

a. Provide a text for transformation, e.g. h=li; li=h; rn=m; m=rn; s=f;

b. Provide a url that contains the dictionary

c. Provide a url for token counts that can be used for choosing the best correctly spelled word based on popularity.

4. Click “Submit” button

C. Once Job finishes, select Job Name

D. View Results by clicking on “spellcheck_report.html”, “replacement_rules.txt”, etc

Attendee Project Plan

• Study/Project Title

• Team Members and their Affiliation

• Procedural Outline of Study/Project– Research Question/Purpose of Study

– Data Sources

– Analysis Tools

• Activity Timeline or Milestones

• Report or Project Outcome(s)

• Ideas on what your team needs from SEASR staff to help you achieve your goal.

Identify Research Question

Discussion Questions

• What analytical tools or applications do you want to utilize with HT data?