Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
-
Upload
beth-plale -
Category
Technology
-
view
176 -
download
0
description
Transcript of Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study in Big Data : the Socio-‐Technical Issues of HathiTrust Digital Texts
Women’s Ins*tute for Summer Enrichment Cornell University, Jun 16, 2014
Beth Plale
Professor, School of Informa?cs and Compu?ng Director, Data To Insight Center
Indiana University
HATHI TRUST RESEARCH CENTER!
• Who are the Players? HathiTrust, Google, Authors Guild
• The Object of AJen?on : 11 M books from university libraries
• Rulings around copyright • HTRC, or why I care • Is security of HTRC Data Capsule good enough?
The Players
Books Digi*za*on Project (2007)
Libraries of U Michigan, U California, Virginia, Wisconsin, Indiana, …
digi*zed books
digi*zed books
digi*ze
digi*zed books
digi*zed books
Legal ac*on
Mar 2011: New York federal judge rejected a $125 million legal se\lement that Google had worked out with the authors and publishers over the copyright issues Nov 2013: same Judge issued ruling saying that Google's use of the works was a "fair use" under copyright law
Google/Authors Guild
• June 2014: 2nd Circuit Court of Appeals ruling on Authors Guild versus HathiTrust (Cornell, U Michigan, U California, U Wisconsin, Indiana) is a major victory for fair use
digi*zed books
Legal ac*on
Highlights 2014 ruling
• With respect to the full-‐text database, the court found that although a copy of the en*re work is made, the purpose of a full-‐text searchable database is so different from that of the underlying works that the use must be considered transforma*ve. In fact, the court wrote, "the crea*on of a full-‐text searchable database is a quintessen*ally transforma*ve use".
June 10, 2014 | By Parker Higgins Another Fair Use Victory for Book Scanning in HathiTrust
• The Authors Guild argued that HathiTrust's use of an iden*cal server and two tape back-‐ups cons*tuted "excessive" copying.
• Thankfully, the court rejected that premise, acknowledging that when it comes to digital technology, an approach that focuses only on individual copies made is insufficient.
June 10, 2014 | By Parker Higgins Another Fair Use Victory for Book Scanning in HathiTrust
Highlights 2014 ruling
Does Authors Guild Represent All Authors?
• The Authors Guild members are overwhelmingly trade-‐book authors; the books scanned by the Hathi Trust are overwhelmingly scholarly books wri\en as part of an academic tradi*on that takes free access and sharing as its founda*on.
• The Authors Alliance : new organiza*on represen*ng authors who are primarily concerned with being read.
Court finds full-‐book scanning is fair use Cory Doctorow at 3:00 pm Sat, Jun 14, 2014
Highlight 2014 Ruling
• Given that consistent fair use record for book digi*za*on, today's ruling might not be totally surprising. S*ll, the text of the opinion is encouraging, and reflects a court that respects the Cons/tu/onal purpose of copyright as a tool to promote the progress of science and the useful arts—not a blunt instrument for rightsholders to regulate all downstream uses.
June 10, 2014 | By Parker Higgins Another Fair Use Victory for Book Scanning in HathiTrust
• Who are the Players? HathiTrust, Google, Authors Guild
• The Object of A\en*on : 11 M books from university libraries
• Rulings around copyright • HTRC, or why I care • Is security of HTRC Data Capsule good enough?
HTRC, or why I care: HathiTrust digital library is “big data”;
and Text mining is the new library catalog
search
Similar model, different ends
$$
HTRC goes beyond “full text searchable database”
Scholarly search
Scholarly mining
#HTRC @HathiTrust
HathiTrust
• HathiTrust is a consor*um of academic & research ins*tu*ons, offering a collec*on of millions of *tles digi*zed from libraries around the world. – Founding members: University of Michigan, Indiana University, University of California, and University of Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à Dis*nguished from
#HTRC @HathiTrust
#HTRC @HathiTrust
Content of HathiTrust
• Books and journals – Plus pilots around images, audio, born-‐digital
• Digi*za*on sources – Google (96.8%, 10,162,104) – Internet Archive (2.9%, 301,972) – Local (0.3%, 31,840)
#HTRC @HathiTrust
Content Sources
#HTRC @HathiTrust
Content distribu*on
360,000 volumes in Spanish
#HTRC @HathiTrust
Mo?va?on for HTRC
à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
#HTRC @HathiTrust
HathiTrust Research Center
• The HathiTrust Research Center (HTRC) was established in 2011 to enable computa*onal research across a comprehensive body of published works, for the purposes of scholarship, educa*on, and inven*on.
• HTRC Execu*ve Commi\ee – Beth Plale, co-‐Director, Professor of Informa*cs and Compu*ng, Indiana University
– J. Stephen Downie, co-‐Director, Professor of Informa*on Science, University of Illinois
– Robert McDonald, Indiana University Libraries – Beth Namachchivaya Sandore, University of Illinois Library – John Unsworth, CIO, Dean of Library, Brandies University
HTRC system
Complexity hiding interface
The complexity
Tabular info
Sta*s*cal plots
Spa*al plots
Request
Complexity
hiding interface
Text mining at scale: quick tutorial on topic modeling of texts
#HTRC @HathiTrust
Topic Modeling
• Can answer more complex or nuanced ques*ons – What are the primary themes of an author? – What are the primary themes of a research domain?
– When did a new topic enter a research domain? • Provides more data than word counts
– 100s of topics can be extracted. – Underlying data (topics, volume, and page) is available
#HTRC @HathiTrust
Themes for Authors Two topics with iden*cal centrali*es (e.g., Dickens) but separate themes
More strongly focused on book (illustra*ons, volume, literature)
More strongly focused on author himself (le\ers, household, house)
Ted Underwood, Univ of Illinois
Digging into philosophy of science
Establish points of contact between philosophy and
science: where philosophical arguments on
anthropomorphism appear in science texts
Colin Allen, IU
The How
• 1315 volumes from HTRC selected using keyword search for ‘darwin’, ‘romanes’, ‘anthropomorphism’, and ‘compara*ve psychology’
• Set contains lots of uninteres*ng books: e.g., college course catalogs
• Apply topic modeling on 86 volume subset • Using iPy Notebook
.. Of set of topics, choose ‘16’ as best
Volumes most similar to topic 16
Copyright: A Reality Full text download is limited by both
size and by copyright
HTRC solu*on to fully-‐flexible text mining research on en*re HT digital repository: HTRC Data Capsule
Funded by Alfred P. Sloan Founda*on; in collabora*on with Atul Prakash, University of Michigan
#HTRC @HathiTrust
Ques*ons driving HTRC Data Capsule
• Non-‐consump*ve use: can framework provide safe handling of large amounts of protected data?
• Openness: can framework support user-‐contributed analysis without resor*ng to code walkthroughs prior to acceptance?
• Large-‐scale and low cost: can protec*ons be extended to u*liza*on of large-‐scale na*onal (public) computa*onal resources?
#HTRC @HathiTrust
HTRC Data Capsules
• Trusts text mining researcher to not deliberately leak repository data
• Prevents malware ac*ng on user’s behalf from leaking data.
• V1.0 limits analysis to running within single VM
VM Image Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Secure Capsule cluster
SSH Research results
Researcher
HTRC Data Capsule Architectural Components
Registry Services, worksets
VM Image
Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Upon run, Secure Capsule:
controls I/O behind scenes
SSH Research results
Researcher
HTRC Data Capsule interac*on
Researcher requests new VM of type X
Researcher install tools onto VM through window on her desktop.
Registry Services, worksets
Final loca*on of results is registry
1)
2)
Image instance is created
3)
4)
setup
41
HTRC secure data capsule: view from researcher desktop
Thanks to our sponsors
HTRC goes beyond “full text searchable database”. Security has to be top concern.
scholarly research
HTRC goes beyond “full text searchable database”