Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European...

Post on 12-Jul-2015

1.556 views 1 download

Tags:

Transcript of Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by Liang Shen, European...

Edanz Journal Selector a Prototype based on Solr/Nutch/Hadoop

Liang SHEN

@shenzhuxi Web Developer European Bioinformatics Institute Drupal/Solr

Edanz Journal Selector (2011)

So many journals!

DEMO

Open Access

•  By National Center for Biotechnology Information, U.S. National Library of Medicine •  Approximately 26,000 records are included in the PubMed journal lists

Feeds Journal TOCs •  21,498 journals from 1,677 publishers •  Institute for Computer Based Learning •  Heriot-Watt University

Springer •  Springer Metadata API

•  Provides  metadata  for  over  5  million  online  documents  •  Springer Open Access API

•  Provides  metadata,  full-­‐text  content,  and  images  for  over  80,000  open  access  ar:cles    

Open Source Stack

•  Infrastructure: Amazon Web Service •  Data processing: Hadoop/Hive •  Index: Solr/Lucene •  Web service: Drupal •  Piwik

HDFS  

Index  

Feeds  API   Web  

Springer Journal Selector

Chinese

Japanese

Scalability •  Shards

Internet vs. Intranet

Re-think after 3 years

Don't use Hadoop (<5TB)

Thanks! Liang Shen