Plale HathiTrust El Colegio de Mexico May2014

50
HathiTrust and HTRC: the changing Digital Library El Colegio de Mexico | 20.May.14 Beth Plale – @bplale Professor, School of InformaCcs and CompuCng Director, HathiTrust Research Center Indiana University Tweet us @HathiTrust #HTRC HATHI TRUST RESEARCH CENTER

description

HathiTrust digital library and analytics with HTRC. The data, the uses.

Transcript of Plale HathiTrust El Colegio de Mexico May2014

Page 1: Plale HathiTrust El Colegio de Mexico May2014

HathiTrust  and  HTRC:  the  changing  Digital  Library  

El  Colegio  de  Mexico  |  20.May.14      

Beth  Plale  –  @bplale    Professor,  School  of  InformaCcs  and  CompuCng  

Director,  HathiTrust  Research  Center    Indiana  University  

Tweet  us  -­‐  @HathiTrust    #HTRC  

HATHI TRUST RESEARCH CENTER!

Page 2: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

HathiTrust  

•  HathiTrust  is  a  consorBum  of  academic  &  research  insBtuBons,  offering  a  collecBon  of  millions  of  Btles  digiBzed  from  libraries  around  the  world.  – Founding  members:  University  of  Michigan,  Indiana  University,  University  of  California,  and  University  of  Virginia  

http://www.hathitrust.org/htrc  

http://www.hathitrust.org  

à  DisBnguished  from  

Page 3: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Page 4: Plale HathiTrust El Colegio de Mexico May2014

Take  look  at  Details  of  HathiTrust  CollecBon    

Page 5: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Content  

•  Books  and  journals  – Pilots  around  images,  audio,  born-­‐digital  

•  DigiBzaBon  sources  – Google  (96.8%,  10,162,104)  –  Internet  Archive  (2.9%,  301,972)  – Local  (0.3%,  31,840)  

Page 6: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Content  Sources  

Page 7: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Content  Package  

Page 8: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Metadata  

•  Bibliographic  •  Structural  •  Rights  •  AdministraBve  (preservaBon)  •  Holdings  

Page 9: Plale HathiTrust El Colegio de Mexico May2014

HathiTrust    Repository  OrganizaBon  

Page 10: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

HathiTrust  Repository  OrganizaBon  

Page 11: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

File  System  

Page 12: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Content  distribuBon  

Page 13: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Content  distribuBon  

Not  public  domain  outside  available  

Page 14: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

à  HathiTrust repository is a latent goldmine for text mining analysis, analysis of large-scale corpi through computational tools, and time-based analysis à Restricted nature of HT content suggests need for new forms of access that preserve intimate nature of research investigation while honoring restrictions à  Paradigm: computation moves to the data (not vice versa)

Page 15: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

 Mission  of  HT  Research  Center  

•  Research  arm  of  HathiTrust    •  Goal:    enable  researchers  world-­‐wide  to  carry  out  computaBonal  invesBgaBon  of  HT  repository  through  –  Develop  model  for  access:  the  ‘workset’  –  Develop  tools  that  facilitate  research  by  digital  humaniBes  and  informaBcs  communiBes  

–  Develop  secure  cyberinfrastructure  that  allows  computaBonal  invesBgaBon  of  enBre  copyrighted  and  public  domain  HathiTrust  repository  

•  Established:    July,  2011  •  CollaboraBve  effort  of  Indiana  University  and  University  of  Illinois  

   

Page 16: Plale HathiTrust El Colegio de Mexico May2014

HTRC  system    

Complexity  hiding  interface  

The  complexity  

Tabular  info  

StaBsBcal  plots  

SpaBal  plots  

Request  

Page 17: Plale HathiTrust El Colegio de Mexico May2014

   

Complexity

 hiding  interface  

   

Page 18: Plale HathiTrust El Colegio de Mexico May2014

Workset  builder  

Page 19: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

HTRC  Timeline  •  Phase  I:    development  01  Jul  2011  –  31  Mar  2013      

–  HTRC  soiware  and  services  release  v1.0  hjp://sourceforge.net/p/htrc/code/    

•  Phase  II:    outreach,  01  Apr  2013  -­‐  present  –  2nd  HTRC  UnCamp  Sep  ‘13  

  Ajendees  of  UnCamp’13  

Page 20: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Access  to  copyrighted  materials:  HTRC  Data  Capsule  

A  secure  compuBng  framework  that:  •  Trusts  that  researcher  will  not  deliberately  leak  repository  data,  but  •  Prevents  malware  acBng  on  user's  behalf  from  leaking  data.    Enforces:  •  Non-­‐consumpBve  use:    framework  provides  safe  handling  of  large  

volumes  of  protected  data  •  Openness:  framework  supports  user-­‐contributed  analysis  tools  

(that  is,  not  limit  uses  to  a  known  set  of  algorithms)  •  Efficiency:  framework  supports  user-­‐contributed  analysis  tools  

without  resorBng  to  code  walkthroughs  prior  to  acceptance  •  Large-­‐scale  and  low  cost:    protecBons  can  be  extended  to  uBlizaBon  

of  large-­‐scale  naBonal  (public)  supercomputers  

Page 21: Plale HathiTrust El Colegio de Mexico May2014

VM  Image  Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Secure  Capsule  cluster  

SSH   Research  results  

Researcher  

HTRC  Secure  Capsule  Architectural  Components  

   

Registry    Services,  worksets  

 

 

Page 22: Plale HathiTrust El Colegio de Mexico May2014

VM  Image  

Manager  

VM  Image  Store  

VM  Image  Builder  

VM  Manager  

VM  instance  

Upon  run,  Secure  Capsule:  

controls  I/O  behind  scenes  

SSH   Research  results  

Researcher  

HTRC  Secure  Capsule  Architecture  

Researcher  requests    new  VM  of  type  X  

Researcher  install  tools  onto  VM  through  window  on  her  desktop.    

   

Registry    Services,  worksets  

 

 

Final  locaBon  of  results  is  registry  

1)  

2)  

Image  instance  is  created  

3)  

4)  

Page 23: Plale HathiTrust El Colegio de Mexico May2014

23  

HTRC  secure  data  capsule:  view  from  researcher  desktop  

Page 24: Plale HathiTrust El Colegio de Mexico May2014

EXAMPLES  OF  RESEARCH  CARRIED  OUT  THROUGH  HATHI  TRUST  RESEARCH  CENTER  

•  Author  Gender  IdenBficaBon  •  Using  Topic  Modeling  to  Locate  (down  to  

sentence  level)  Philosophical  Arguments  in  Science  Texts  

Page 25: Plale HathiTrust El Colegio de Mexico May2014

GENDER  IDENTIFICATION  OF  HTRC  AUTHORS  BY  NAMES    

Stacy  Kowalczyk,  Asst.  Professor,  Dominican  University  Zong  Peng,  HTRC,  Indiana  University  

Ref  talk  by  Stacy  Kowalczyk,  hjp://www.hathitrust.org/htrc_uncamp2013  

Page 26: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Gender  IdenBficaBon  of  Text  

•  QuesBon  InvesBgated:  Can  we  use  author  names  in    bibliographic  records  to  idenBfy  gender?  

•  2.6  million  bibliographic  records  –  Extracted  personal  author  data    – Marc  100  abcd  and  700  abcd  

•  606,437  unique  personal  author  strings  •  Bibliographic  data  is  not  fielded  like  patent  names  •  Relying  on  Standard  cataloging  pracBce  

–  Last  name,  first  name  middle  name,    Btles/honorifics,  dates  

Page 27: Plale HathiTrust El Colegio de Mexico May2014

Why  interesBng  to  HTRC?  Introduces  new  source  of  metadata  and  from  sources  with  

varying  authority            

 Raises  quesBons:  1)  How  should  community  contributed  metadata  

be  disBnguished  from  more  authoritaBve  sources?    

2)  How  should  variability  of  quality  even  within  a  single  contribuBon  be  conveyed  to  community?  

Page 28: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Authors  vs  Names  •  Methuen,  Algernon  Methuen  Marshall,  Sir  bart.,  1856-­‐1924  

•  Methuem,  Algernon    •  Methuen  Algernon    •  Methuen  Marshall,  Sir,  bart.,  1856-­‐    •  Methuen,  A.  Sir,  1856-­‐1924    •  Methuen,  A.  Sir,  bart.,  1856-­‐1924    •  Methuen  Marshall,  Sir  bart  1856-­‐1924    •  Methuen,  Algernon  Methuen  Marshall,  Sir,  1856-­‐1924  •  Methuen,  Algernon  Methuen  Marshall,  Sir,  bart.,  1856-­‐1924  

•  Methuen,  Algernon,  1856-­‐1924      

Page 29: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Sources  of  Data  •  The  Virtual  InternaBonal  Authority  File  

– Hosted  by  OCLC  •  Harvested  names  from  mulBple  data  sources  

–  Census  bureau    –  Baby  name  sites  

•  EU  Patent  Research  names  list  (Frietsch  et  al,  2009;  Naldi  et  al.  2005)  – Developed  an  extensive  list  of  European  names  

•  Titles  and  honorifics  – MulBple  web  resources    –  Sir,  Baron,  Count,  Duke,  Father,  Cardinal,  etc  –  Lady,  Mrs.  Miss,  Countess,  Duchess,  Sister,  etc  

Page 30: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

IniBal  Gender  Results  

•  Approximately  80%  of  name  strings  have  iniBal  gender  idenBficaBon  –  Female  

•  59,365  •  10%  

– Male  •  425,994  •  70%  

–  Unknown  •  114,204  •  19%  

–  Ambiguous  •  5,965  •  Less  than  1%  

Page 31: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

Results  by  Data  Source  

Against  the  whole  set  of  name  strings  •  VIAF      

– 19%  hit  rate    •  Web  Names  

– 54%  hit  rate  •  Patents  Names  

– 8%    

Page 32: Plale HathiTrust El Colegio de Mexico May2014

Colin  Allen,  Jamie  Murdock  CogniCve  Science,  Indiana  University  

Ref  talk  by  Jamie  Murdock,  hjp://www.hathitrust.org/htrc_uncamp2013  

Page 33: Plale HathiTrust El Colegio de Mexico May2014

The  InPhO  project  is  instrucBve  because  it  demonstrates  an  interacBon  sequence  between  

a  researcher  and  his/her  corpus  that  is  nuanced,  is  mulBstep,  and  mulB-­‐modal.      

 The  HTRC  cyberinfrastructure  must  be  able  to  handle  such  a  nuanced  form  of  interacBon  between  a  researcher  and  their  texts.  

Page 34: Plale HathiTrust El Colegio de Mexico May2014

Digging  into  philosophy  of  science  

•  Establish  points  of  contact  between  philosophy  and  science:  where  philosophical  arguments  on  anthropomorphism  appear  in  science  texts  

•  Use  topic  modeling  to  idenBfy  the  volumes  and  pages  within  these  volumes  that  are  “rich”  in  a  chosen  topic  

•  Use  semi-­‐formal  discourse  analysis  technique  to  idenBfy  key  arguments  in  selected  pages  to  incrementally  expose  and  represent  argument  structures  

Page 35: Plale HathiTrust El Colegio de Mexico May2014

The  How  

•  1315  volumes  from  HTRC  selected  using  keyword  search  for  ‘darwin’,  ‘romanes’,  ‘anthropomorphism’,  and  ‘comparaBve  psychology’  

•  Set  contains  lots  of  uninteresBng  books:    e.g.,  college  course  catalogs  

•  Apply  LDA  on  86  volume  subset    •  Using  iPy  Notebook  

Page 36: Plale HathiTrust El Colegio de Mexico May2014

LDA  topic  modeling  

•  LDA  (Latent  Dirichlet  Analysis)  uses  a  Bayesian  updaBng  method  to  generate  a  set  of  “topics”  –  probability  distribuBons  over  set  of  terms  in  a  corpus  

•  Number  of  topics  is  a  parameter  in  the  modeling  technique  

•  Method  finds  set  of  topics  that  is  best  able  to  reproduce  the  term  distribuBons  in  documents  belonging  to  the  corpus  

•  Documents  may  be  whole  volumes,  chapters,  arBcles,  single  pages,  even  individual  sentences  –  modeler’s  choice  

Page 37: Plale HathiTrust El Colegio de Mexico May2014

Volume  level  topic  modeling  on  ‘anthropomorphism’  yields  set  of  

topics  

Page 38: Plale HathiTrust El Colegio de Mexico May2014

..  Of  set  of  topics,  choose  ‘16’  as  best  

Page 39: Plale HathiTrust El Colegio de Mexico May2014

Volumes  most  similar  to  topic  16  

Page 40: Plale HathiTrust El Colegio de Mexico May2014

Repeat  LDA  at  page  level  

Page 41: Plale HathiTrust El Colegio de Mexico May2014

Topic  model  at  page  level  for  topics  anthropomorphism,  animal,  and  psychology  

Page 42: Plale HathiTrust El Colegio de Mexico May2014

Words  sorted  by  similarity  

Page 43: Plale HathiTrust El Colegio de Mexico May2014

Pick  top  3:  topics  16,  10,  26  

Page 44: Plale HathiTrust El Colegio de Mexico May2014

Show  documents  of  topics  10,  16,  26  

Page 45: Plale HathiTrust El Colegio de Mexico May2014

Drop  to  sentence  level  

•  Select  three  books  with  highest  aggregate  of  20-­‐40  topic-­‐relevant  pages  for  more  precise  analysis  

•  Manually  augment  argument  analysis  – Remodeling  of  three  volumes  at  sentence  level  – Training  other  methods  using  human  analysis  plus  sentence  similarity  

Page 46: Plale HathiTrust El Colegio de Mexico May2014

Promising  early  results  …  

Page 47: Plale HathiTrust El Colegio de Mexico May2014

Scholarly  Commons    User  Support  Service  •  Develop  training  materials    •  EducaBonal  workshops  •  Tool  and  workset  creaBon  •  Collaborate  with  librarians  and  DH  centers  at  HT  insBtuBons  

•  Assist  researchers  in  HTRC  text  data  mining  research  projects  

•  Based  at  University  of  Illinois  Library  

 

47  

Page 48: Plale HathiTrust El Colegio de Mexico May2014

Scholarly  Commons  User  Support  •  Gives  HT  insBtuBons  exclusive  access  to  training  and  learning  materials  

that  help  them  establish  programs  that  integrate  HTRC  tools  and  services  into  their  scholarly  commons  programs  in  libraries  and  digital  humaniBes  centers.      

•  Physically  located  on  the  University  of  Illinois  Library’s  Scholarly  commons.      •  Supported  by  several  Library  staff  and  faculty.    Key  among  these  is  the  

Digital  Humani,es  Research  Specialist  who  will  assist  with  the  development  of  training  and  outreach  iniBaBves  in  support  of  researchers  working  with  the  Hathi  Trust  Research  Center  and  HathiTrust  digital  library  affiliates  who  seek  to  start  their  own  HTRC  research  services.    

•  Effort  involves  planning,  implementaBon  and  conBnuous  development  of  training  materials,  educaBonal  workshops,  and  potenBal  tools,  and  outreach  acBviBes  in  support  of  the  usage  of  HTRC  tools  and  datasets.  

Page 49: Plale HathiTrust El Colegio de Mexico May2014

Thanks  to  sponsors  

Page 50: Plale HathiTrust El Colegio de Mexico May2014

#HTRC    @HathiTrust  

http://www.hathitrust.org/htrc  

http://www.hathitrust.org