Plale HathiTrust El Colegio de Mexico May2014
-
Upload
beth-plale -
Category
Data & Analytics
-
view
490 -
download
1
description
Transcript of Plale HathiTrust El Colegio de Mexico May2014
HathiTrust and HTRC: the changing Digital Library
El Colegio de Mexico | 20.May.14
Beth Plale – @bplale Professor, School of InformaCcs and CompuCng
Director, HathiTrust Research Center Indiana University
Tweet us -‐ @HathiTrust #HTRC
HATHI TRUST RESEARCH CENTER!
#HTRC @HathiTrust
HathiTrust
• HathiTrust is a consorBum of academic & research insBtuBons, offering a collecBon of millions of Btles digiBzed from libraries around the world. – Founding members: University of Michigan, Indiana University, University of California, and University of Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à DisBnguished from
#HTRC @HathiTrust
Take look at Details of HathiTrust CollecBon
#HTRC @HathiTrust
Content
• Books and journals – Pilots around images, audio, born-‐digital
• DigiBzaBon sources – Google (96.8%, 10,162,104) – Internet Archive (2.9%, 301,972) – Local (0.3%, 31,840)
#HTRC @HathiTrust
Content Sources
#HTRC @HathiTrust
Content Package
#HTRC @HathiTrust
Metadata
• Bibliographic • Structural • Rights • AdministraBve (preservaBon) • Holdings
HathiTrust Repository OrganizaBon
#HTRC @HathiTrust
HathiTrust Repository OrganizaBon
#HTRC @HathiTrust
File System
#HTRC @HathiTrust
Content distribuBon
#HTRC @HathiTrust
Content distribuBon
Not public domain outside available
#HTRC @HathiTrust
à HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis à Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions à Paradigm: computation moves to the data (not vice versa)
#HTRC @HathiTrust
Mission of HT Research Center
• Research arm of HathiTrust • Goal: enable researchers world-‐wide to carry out computaBonal invesBgaBon of HT repository through – Develop model for access: the ‘workset’ – Develop tools that facilitate research by digital humaniBes and informaBcs communiBes
– Develop secure cyberinfrastructure that allows computaBonal invesBgaBon of enBre copyrighted and public domain HathiTrust repository
• Established: July, 2011 • CollaboraBve effort of Indiana University and University of Illinois
HTRC system
Complexity hiding interface
The complexity
Tabular info
StaBsBcal plots
SpaBal plots
Request
Complexity
hiding interface
Workset builder
#HTRC @HathiTrust
HTRC Timeline • Phase I: development 01 Jul 2011 – 31 Mar 2013
– HTRC soiware and services release v1.0 hjp://sourceforge.net/p/htrc/code/
• Phase II: outreach, 01 Apr 2013 -‐ present – 2nd HTRC UnCamp Sep ‘13
Ajendees of UnCamp’13
#HTRC @HathiTrust
Access to copyrighted materials: HTRC Data Capsule
A secure compuBng framework that: • Trusts that researcher will not deliberately leak repository data, but • Prevents malware acBng on user's behalf from leaking data. Enforces: • Non-‐consumpBve use: framework provides safe handling of large
volumes of protected data • Openness: framework supports user-‐contributed analysis tools
(that is, not limit uses to a known set of algorithms) • Efficiency: framework supports user-‐contributed analysis tools
without resorBng to code walkthroughs prior to acceptance • Large-‐scale and low cost: protecBons can be extended to uBlizaBon
of large-‐scale naBonal (public) supercomputers
VM Image Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Secure Capsule cluster
SSH Research results
Researcher
HTRC Secure Capsule Architectural Components
Registry Services, worksets
VM Image
Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Upon run, Secure Capsule:
controls I/O behind scenes
SSH Research results
Researcher
HTRC Secure Capsule Architecture
Researcher requests new VM of type X
Researcher install tools onto VM through window on her desktop.
Registry Services, worksets
Final locaBon of results is registry
1)
2)
Image instance is created
3)
4)
23
HTRC secure data capsule: view from researcher desktop
EXAMPLES OF RESEARCH CARRIED OUT THROUGH HATHI TRUST RESEARCH CENTER
• Author Gender IdenBficaBon • Using Topic Modeling to Locate (down to
sentence level) Philosophical Arguments in Science Texts
GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES
Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University
Ref talk by Stacy Kowalczyk, hjp://www.hathitrust.org/htrc_uncamp2013
#HTRC @HathiTrust
Gender IdenBficaBon of Text
• QuesBon InvesBgated: Can we use author names in bibliographic records to idenBfy gender?
• 2.6 million bibliographic records – Extracted personal author data – Marc 100 abcd and 700 abcd
• 606,437 unique personal author strings • Bibliographic data is not fielded like patent names • Relying on Standard cataloging pracBce
– Last name, first name middle name, Btles/honorifics, dates
Why interesBng to HTRC? Introduces new source of metadata and from sources with
varying authority
Raises quesBons: 1) How should community contributed metadata
be disBnguished from more authoritaBve sources?
2) How should variability of quality even within a single contribuBon be conveyed to community?
#HTRC @HathiTrust
Authors vs Names • Methuen, Algernon Methuen Marshall, Sir bart., 1856-‐1924
• Methuem, Algernon • Methuen Algernon • Methuen Marshall, Sir, bart., 1856-‐ • Methuen, A. Sir, 1856-‐1924 • Methuen, A. Sir, bart., 1856-‐1924 • Methuen Marshall, Sir bart 1856-‐1924 • Methuen, Algernon Methuen Marshall, Sir, 1856-‐1924 • Methuen, Algernon Methuen Marshall, Sir, bart., 1856-‐1924
• Methuen, Algernon, 1856-‐1924
#HTRC @HathiTrust
Sources of Data • The Virtual InternaBonal Authority File
– Hosted by OCLC • Harvested names from mulBple data sources
– Census bureau – Baby name sites
• EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names
• Titles and honorifics – MulBple web resources – Sir, Baron, Count, Duke, Father, Cardinal, etc – Lady, Mrs. Miss, Countess, Duchess, Sister, etc
#HTRC @HathiTrust
IniBal Gender Results
• Approximately 80% of name strings have iniBal gender idenBficaBon – Female
• 59,365 • 10%
– Male • 425,994 • 70%
– Unknown • 114,204 • 19%
– Ambiguous • 5,965 • Less than 1%
#HTRC @HathiTrust
Results by Data Source
Against the whole set of name strings • VIAF
– 19% hit rate • Web Names
– 54% hit rate • Patents Names
– 8%
Colin Allen, Jamie Murdock CogniCve Science, Indiana University
Ref talk by Jamie Murdock, hjp://www.hathitrust.org/htrc_uncamp2013
The InPhO project is instrucBve because it demonstrates an interacBon sequence between
a researcher and his/her corpus that is nuanced, is mulBstep, and mulB-‐modal.
The HTRC cyberinfrastructure must be able to handle such a nuanced form of interacBon between a researcher and their texts.
Digging into philosophy of science
• Establish points of contact between philosophy and science: where philosophical arguments on anthropomorphism appear in science texts
• Use topic modeling to idenBfy the volumes and pages within these volumes that are “rich” in a chosen topic
• Use semi-‐formal discourse analysis technique to idenBfy key arguments in selected pages to incrementally expose and represent argument structures
The How
• 1315 volumes from HTRC selected using keyword search for ‘darwin’, ‘romanes’, ‘anthropomorphism’, and ‘comparaBve psychology’
• Set contains lots of uninteresBng books: e.g., college course catalogs
• Apply LDA on 86 volume subset • Using iPy Notebook
LDA topic modeling
• LDA (Latent Dirichlet Analysis) uses a Bayesian updaBng method to generate a set of “topics” – probability distribuBons over set of terms in a corpus
• Number of topics is a parameter in the modeling technique
• Method finds set of topics that is best able to reproduce the term distribuBons in documents belonging to the corpus
• Documents may be whole volumes, chapters, arBcles, single pages, even individual sentences – modeler’s choice
Volume level topic modeling on ‘anthropomorphism’ yields set of
topics
.. Of set of topics, choose ‘16’ as best
Volumes most similar to topic 16
Repeat LDA at page level
Topic model at page level for topics anthropomorphism, animal, and psychology
Words sorted by similarity
Pick top 3: topics 16, 10, 26
Show documents of topics 10, 16, 26
Drop to sentence level
• Select three books with highest aggregate of 20-‐40 topic-‐relevant pages for more precise analysis
• Manually augment argument analysis – Remodeling of three volumes at sentence level – Training other methods using human analysis plus sentence similarity
Promising early results …
Scholarly Commons User Support Service • Develop training materials • EducaBonal workshops • Tool and workset creaBon • Collaborate with librarians and DH centers at HT insBtuBons
• Assist researchers in HTRC text data mining research projects
• Based at University of Illinois Library
47
Scholarly Commons User Support • Gives HT insBtuBons exclusive access to training and learning materials
that help them establish programs that integrate HTRC tools and services into their scholarly commons programs in libraries and digital humaniBes centers.
• Physically located on the University of Illinois Library’s Scholarly commons. • Supported by several Library staff and faculty. Key among these is the
Digital Humani,es Research Specialist who will assist with the development of training and outreach iniBaBves in support of researchers working with the Hathi Trust Research Center and HathiTrust digital library affiliates who seek to start their own HTRC research services.
• Effort involves planning, implementaBon and conBnuous development of training materials, educaBonal workshops, and potenBal tools, and outreach acBviBes in support of the usage of HTRC tools and datasets.
Thanks to sponsors
#HTRC @HathiTrust
http://www.hathitrust.org/htrc
http://www.hathitrust.org