dd 2012 02 29 - Archive

41
Digital Humani,es At Scale: Hathi Trust Research Center Beth Plale CoDirector, HathiTrust Research Center Professor, School of Informa;cs and Compu;ng Indiana University

Transcript of dd 2012 02 29 - Archive

Page 1: dd 2012 02 29 - Archive

Digital  Humani,es  At  Scale:  Hathi  Trust  Research  Center

Beth  Plale  Co-­‐Director,  HathiTrust  Research  Center  

Professor,  School  of  Informa;cs  and  Compu;ng  Indiana  University

Page 2: dd 2012 02 29 - Archive

New  ques;ons  require  computa;onal  access  to  large  corpus  

Inves;gate  way  in  which  concepts  of  philosophy  are  used  in  physics  through    – Extrac;ng  argumenta;ve  structure  from  large  dataset  using  mixture  of  automated  and  social  compu;ng  techniques  

– Capture  evidence  for  conjecture  that  availability  of  such  analyses  will  enable  innova;ve  interdisciplinary  research  

– Digging  into  Data  2012  award,  Colin  Allen,  IU

Page 3: dd 2012 02 29 - Archive

New  ques;ons,  cont.

Document  through  soNware  text  analysis  techniques,  the  appearance,  frequency  and  context  of  terms,  concepts  and  usages  related  to  human  rights  in  a  selec;on  of  English-­‐language  novels.    –  Ronnie  Lipschutz  of  UCSC  is  currently  doing  this  analysis  on  one  of  Jane  Austen’s  books.  He’d  like  to  extend  the  work  to  encompass  a  far  larger  corpus.      

Page 4: dd 2012 02 29 - Archive

New  ques;ons,  cont.

Iden;fy  all  18th  century  published  books  in  HathiTrust  corpus,  and  apply  topic  modeling  to  create  a  consistent  overall  subject  metadata

Page 5: dd 2012 02 29 - Archive

 GOOGLE  DIGITAL  HUMANITIES  AWARDS  RECIPIENT  

INTERVIEWS  REPORT  PREPARED  FOR  THE  HATHITRUST  RESEARCH  CENTER  

VIRGIL  E.  VARVEL  JR.  ANDREA  THOMER  

CENTER  FOR  INFORMATICS  RESEARCH  IN  SCIENCE  AND  SCHOLARSHIP  

UNIVERSITY  OF  ILLINOIS  AT  URBANA-­‐CHAMPAIGN

Fall 2011

Page 6: dd 2012 02 29 - Archive

The  study

• Dr.  John  Unsworth,  a  representa;ve  of  HTRC,  distributed  invita;ons  to  par;cipate  in  this  study  via  email  to  22  researchers  given  Google  Digital  Humani;es  Research  Awards.  

• Interviews  were  conducted  via  telephone,  Skype®,  or  face-­‐to-­‐face,  and  all  were  audio  recorded.  All  par;cipants  agreed  to  IRB  permission  statement  via  email.    

• A  semi-­‐structured  interview  protocol  was  developed  with  input  from  HTRC  to  elicit  responses  from  par;cipants  on  primary  goals  of  project.

Page 7: dd 2012 02 29 - Archive

Select  findings

• Op;cal  Character  Recogni;on    – Steps  should  be  taken  to  improve  OCR  quality  if  and  when  possible    

– Scalability  of  scanned  image  viewing  is  necessary  for  OCR  reference  and  correc;on    

– Metadata  should  expose  the  quality  of  OCR  

Page 8: dd 2012 02 29 - Archive

Select  Findings

• “Would  like  be_er  metadata  about  text  languages,  par;cularly  in  mul;-­‐text  documents  and  on  language  by  sec;ons  within  text.  Automa;c  language  iden;fica;on  func;ons  would  be  helpful,  but  human-­‐created  metadata  is  preferred,  par;cularly  for  documents  with  low  OCR  quality.”    

• “primary  issue  was  retrieving  the  bibliographic  records  in  usable  form,  unparsed  by  Google.  […]  process  took  10  months  to  design  the  queries  and  get  the  data.”  

Page 9: dd 2012 02 29 - Archive

HathiTrust  Research  Center:  dedicated  to  provision  of  computa;onal  access  to  

comprehensive  body  of  published  works  for  scholarship  and  

educa;on

Page 10: dd 2012 02 29 - Archive

! HathiTrust is large corpus providing opportunity for new forms of computation investigation. ! The bigger the data, the less able we are to move it to a researcher’s desktop machine ! Future research on large collections will require computation moves to the data, not vice versa

Page 11: dd 2012 02 29 - Archive

Goal  of  HTRC  

• HTRC  will  provide  a  persistent  and  sustainable  structure  to  enable  original  and  cueng  edge  research.    

• S;mulate  the  development  in  community  of  new  func;onality  and  tools  to  enable  new  discoveries  that  would  not  be  possible  without  the  HTRC.  

Page 12: dd 2012 02 29 - Archive

Goal,  cont.  

• Leverage  data  storage  and  computa;onal  infrastructure  at  Indiana  U  and    UIUC,    

• Provision  secure  computa;onal  and  data  environment  for  scholars  to  perform  research  using  HathiTrust  Digital  Library.  

• Center  will  break  new  ground,  allowing  scholars  to  fully  u;lize  content  of  HathiTrust  Library  while  preven;ng  intellectual  property  misuse  within  confines  of  current  U.S.  copyright  law.  

Page 13: dd 2012 02 29 - Archive

02/29/12

Descr iption of Application Space in H T R C

Prepared by Jiaan 1. The Whole Diagram

Tag Cloud

Entities Timeline

Text Summarizer

Readability Test

Term UsageConcept

NLP PoS

Concatenate Text Text Extractor

NLP Tokenizer NLP Sentence Detector Token Filter

NLP Name Entity

NLP Sentence Tokenizer

Sentiment Tracking Naive Bayes

Decision Tree

Author, document, keyword

relationship

Topic Modeling

Advanced Search

����������trace

Track a certain topic (e.g.

Humane right)

Simple StatisticClassificationTracking Trend

User

Basic Application Units

Applications

Basic Operations

Open Read Seek Close

File System API

Network Graph

Search

Semantic Relation Metadata

Metadata Access

Latent Semantic Analysis

• Analysis on 10,000,000+ volumes of HathiTrust digital repository

• Founded 2011 • Working with OCR • Large-scale data storage

and access • HPC and Cloud

Type  of  Data  (Public  domain  and  copyrighted  works)

Es;mated  ini;al  size:  300-­‐500  TB

Solr  Indexes 36  TB  (3  indexes)

File  system  rsync 12  TB

Fast  volume  access  store   30TB

Versions  of  collec;on  (5) 120  TB

Volume  store  indexes 100  TB

HathiTrust Research Center

13

Page 14: dd 2012 02 29 - Archive

02/29/12

Corpus  Usage  Pa_ernsChapter 1

Chapter 1

Chapter 1

Page IV

Page IV

Page IV

Table of Contents 1………….# 2…………##

Table of Contents 1………….# 2…………##

Table of Contents 1………….# 2…………##

Access by chapter

Access by page

Access by special contents (table of contents, index, glossary)

14

Page 15: dd 2012 02 29 - Archive

HTRC  Timeline

• Phase  I:    an  18-­‐mo  development  cycle  – Began  01  July  2011  

– Demo  of  capability  June  2012  (12  mo  mark)  

• Phase  II:    broad  availability  of  resource,  begins  01  January  2013

Page 16: dd 2012 02 29 - Archive

Governance

• HTRC  Exec  Management  Team:  Beth  Plale  (IU),  chair;  Robert  McDonald  (IU),  Marshall  Sco_  Poole  (UIUC),  J.  Stephen  Downie  (IU),  John  Unsworth  (Brandeis  Univ)  

• Advisory  board  • MOUs  guide  IU-­‐UIUC  interac;on  and  HTRC-­‐HT  interac;on  

• Laine  Farley,  California  Digital  Library,  and  HT  Execu;ve  Commi_ee  is  liaison  to  HTRC  

• Google  Public  Domain  agreement  –  in  process  of  signing  (IU  and  UIUC  individually  execu;ng)

Page 17: dd 2012 02 29 - Archive

02/29/12

What  is  it  architecturally?

• Web  services  architecture  and  protocols  

• Registry  of  services  and  algorithms  

• Solr  full  text  indexes  

• noSQL  store  as  volume  store  

• Large  scale  compu;ng  

• openID  authen;ca;on  

• Portal  front-­‐end,  programma;c  access  

• SEASR  mining  algos17

Page 18: dd 2012 02 29 - Archive

02/29/12

Agent  framework

Page/volume  tree  (file  system)

Volume  store    (Cassandra)

SEASR  analy;cs  service

Web portalDesktop SEASR client

Task    deployment

WSO2  registry  services,  collec;ons,  data  

capsule  images

Solr    index

HathiTrust  corpusrsync

HTR

C  Data  API  v0.1

Future  Grid

NCSA  local  resources

Penguin  on  Demand

Programma;c  access    e.g.,

CI  logon  (NCSA)

 Access  control  (e.g.  Grouper)

University of Michigan

Meandre  Orchestra;on

Agent  instanceAgent  

instance

Agent  instanceAgent  

instance

Non-consumptive Data capsules

NCSA  HPC  resources

18

Blacklight

Volume  store    (Cassandra)Volume  store    (Cassandra)

Page 19: dd 2012 02 29 - Archive

02/29/12

One  access  point:  through  SEASR

19

Page 20: dd 2012 02 29 - Archive

02/29/12

Query = title:Abraham Lincoln AND publishDate:1920

20

SEASR: workflow used to generate tagcloud

Page 21: dd 2012 02 29 - Archive

02/29/12Query = author:Withers, Hartley 21

Page 22: dd 2012 02 29 - Archive

02/29/12

Workflow invokes HTRC Solr index and HTRC data API.

22

Page 23: dd 2012 02 29 - Archive

HTRC  Solr  index

• The  Solr  Data  API  0.1  test  version  available.  – Preserves  all  query  syntax  of  original  Solr,    

– Prevents  user  from  modifica;on,  

–  Hides  the  host  machine  and  port  number  HTRC  Solr  is  actually  running  on,  

– Creates  audit  log  of  requests,  and    

– Provides  filtered  term  vector  for  words  star;ng  with  user-­‐specified  le_er      

• Test  version  service  soon  available

Page 24: dd 2012 02 29 - Archive

HTRC  Solr  index:  most  used  API  pa_erns

• =====  query  • h_p://coffeetree.cs.indiana.edu:9994/solr/select/?q=ocr:war    • =====faceted  search  • h_p://coffeetree.cs.indiana.edu:9994/solr/select/?

q=*:*&facet=on&facet.field=genre    • ===get  frequency  and  offset  of  words  star;ng  with  le_er  • h_p://coffeetree.cs.indiana.edu:9994/solr/gewreqoffset/inu.

32000011575976/w  • =====  banned  modifica;on  request:  •  h_p://localhost:8983/solr/update?

stream.body=<delete><query>id:298253</query></delete>&commit=truethanks,

Page 25: dd 2012 02 29 - Archive

Requirement:  run  big  jobs  on  large  scale  (free  or  nearly  free)  compute  resources

Page 26: dd 2012 02 29 - Archive

02/29/12

Data Capsules VM Cluster

Provide secure VM

Scholars

Remote Desktop Or VNC

Submit secure capsule map/reduce Data Capsule images to FutureGrid. Receive and review results

FutureGrid  Computa;on  

Cloud

HTRC  Volume  Store    and  Index  

Secure Data Capsule

Page 27: dd 2012 02 29 - Archive

02/29/12

HTRC  block  store    Take  experiment  with  5M  vols  (4TB  uncompressed  

data.)    Block  abstrac;on  at  vol  

size  or  larger.

User  buffer  for  secondary  data  user  submits  for  use  in  

computa;on

Mapreduce  headnode.    

Par;;ons  4TB  evenly  

amongst  1000  nodes.    Trusted  

because  run  as  user  HTRC.        User  buffer  needs  to  be  

copied  to  each  map  node.

Map  node  (2)

Map  node  (1)

Map  node  (0)

Map  node  (999)

4GB  chunk  +  user  buffer

4GB  chunk

4GB  chunk

4GB  chunk

Reduce

Read only

Single source of write from map node. < 1GB per map node

HTRC FutureGrid

Data Capsule nodes

Data Capsule node

Provenance  capture  (through  Karma  provenance  tool)

Page 28: dd 2012 02 29 - Archive

02/29/12 28

Page 29: dd 2012 02 29 - Archive

02/29/12 29

Page 30: dd 2012 02 29 - Archive

Crea;ng  Func;onality  around  Non-­‐consump;ve  Research

Key  no;ons

Page 31: dd 2012 02 29 - Archive

No;on  of  a  Proxy

• “researcher”  in  Google  Book  Se_lement  defini;on  means  a  human  

• An  avatar  is  a  virtual  life  form  ac;ng  on  behalf  of  person  

• Computer  program  acts  on  behalf  of  person    • Computer  program  (i.e.,  proxy)  must  be  able  to  read  Google  texts,  otherwise  computa;onal  analysis  is  impossible  to  carry  out.    

• So  non-­‐consump;on  applied  to  books  applies  to  human  consump;on.    

Page 32: dd 2012 02 29 - Archive

Non-­‐consump*ve  research  –  implementa*on  defini*on

• No  ac*on  or  set  of  ac*ons  on  part  of  users,  either  ac*ng  alone  or  in  coopera*on  with  other  users  over  dura*on  of  one  or  mul*ple  sessions  can  result  in  sufficient  informa*on  gathered  from  collec*on  of  copyrighted  works  to  reassemble  pages  from  collec*on.  

• Defini;on  disallows  collusion  between  users,  or  accumula;on  of  material  over  ;me.    Differen;ates  human  researcher  from  proxy  which  is  not  a  user.    Users  are  human  beings.  

Page 33: dd 2012 02 29 - Archive

No;on  of  Algorithm

• Computa;onal  analysis  is  accomplished  through  algorithms    – An  algorithm  carries  out  one  coherent  analysis  task:  sort  list  of  words,  compute  word  frequency  for  text      

• Researcher’s  computa;onal  analysis  oNen  requires  running  sequence  of  algorithms.        Important  dis;nc;on  for  implemen;ng  non-­‐consump;ve  research  is  “who  owns  the  algorithm”?

Page 34: dd 2012 02 29 - Archive

Infrastructure  for  computa;onal  analysis

• When  needing  to  support  computa;on  over  10+M  volume  corpus,  algorithms  must  be  co-­‐located  with  data.    

• That  is,  algorithms  must  be  located  where  repository  is  located,  and  not  on  user’s  desktop.    

• When  computa;onal  analysis  is  to  be  non-­‐consump;ve,  likely  one  loca;on  for  the  data.

Page 35: dd 2012 02 29 - Archive

Who  owns  algorithm?

• HTRC  owns  the  algorithms,    – use  SoNware  Environment  for  Advancement  of  Scholarly  Research  (SEASR)  suite  of  algorithms  

– we  are  examining  security  requirements  of  users,  algorithms,  and  data

Page 36: dd 2012 02 29 - Archive

User  owns  and  submits  their  algorithms

• HTRC  recently  received  funding  from  Alfred  P.  Sloan  founda;on  to  prototype  “data  capsule  framework”  that  provisions  for  non-­‐consump;ve  research.  

• Founded  on  principle  of  “trust  but  verify”.  Informa;cs-­‐savvy  humani;es  scholar  is  given  freedom  to  experiment  with  new  algorithms  on  protected  informa;on,  but  technological  mechanisms  in  place  to  prevent  undesirable  behavior  (leakage.)

Page 37: dd 2012 02 29 - Archive

Non-­‐consump;ve,  user-­‐owned  algorithms  infrastructure;  requirements:

• Implements  non-­‐consump;ve  

• Openness  –  users  not  limited  to  using  known  set  of  algorithms  

• Efficiency  –  Not  possible  to  analyze  algorithms  for  conformance  prior  to  running  

• Low  cost  and  scale  –  Run  at  large-­‐scale  and  low  cost  to  scholarly  community  of  users  

• Long  term  value  –adop;on  for  other  purposes  

Page 38: dd 2012 02 29 - Archive

02/29/12

Descr iption of Application Space in H T R C

Prepared by Jiaan 1. The Whole Diagram

Tag Cloud

Entities Timeline

Text Summarizer

Readability Test

Term UsageConcept

NLP PoS

Concatenate Text Text Extractor

NLP Tokenizer NLP Sentence Detector Token Filter

NLP Name Entity

NLP Sentence Tokenizer

Sentiment Tracking Naive Bayes

Decision Tree

Author, document, keyword

relationship

Topic Modeling

Advanced Search

����������trace

Track a certain topic (e.g.

Humane right)

Simple StatisticClassificationTracking Trend

User

Basic Application Units

Applications

Basic Operations

Open Read Seek Close

File System API

Network Graph

Search

Semantic Relation Metadata

Metadata Access

Latent Semantic Analysis

Categories  of  algorithms.  Can  fair  use  be  determined  based  on  categoriza;on  of  

algorithm?    Or  is  all  computa;onal  use  fair  use?  

38

Page 39: dd 2012 02 29 - Archive

Algo  results  fair  use?

• Center  supplied  – Easier  because  we  know  category  of  algorithm  

• User  supplied  – HTRC  is  not  examining  code,  so  open  ques;on

Page 40: dd 2012 02 29 - Archive

Par;ng  philosophy

• Finally,  results  of  computa;onal  research  that  conforms  to  restric;ons  of  non-­‐consump;ve  research  must  belong  to  researcher  

Page 41: dd 2012 02 29 - Archive

How  to  Engage

• Building  partnership  with  researchers  and  research  communi;es  is  key  goal  of  the  HathiTrust  Research  Center    

• HTRC  can  give  technical  advice  to  researchers  as  they  look  for  funding  opportuni;es  involving  access  to  research  data    

• Upcoming  “Fix  the  OCR  and  Metadata  Shortage  Community  Challenge”  :  help  us  address  couple  key  weaknesses  of  HT  corpus