SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite:...
Transcript of SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite:...
![Page 1: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/1.jpg)
SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital
Libraries by Crawling the Web
Pradeep Teregowda*, Isaac Councill#, Juan Fernandez*, Shuyi Zheng*, Madian Khabsa*, C. Lee Giles*
* Pennsylvania State University
![Page 2: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/2.jpg)
SeerSuiteA framework for building digital libraries.
Reliable – around the clock service with minimal downtime
Robust – continue providing services, even while some components are constrained.
Scalable – support increasing user requests and documents.
Flexible (modular), Portable (across operating systems).
Features
Automatic acquisition of new documents by focused web crawling
Full text indexing
Autonomous citation indexing, linking documents through citations.
Automatic metadata extraction for each document.
MyCiteSeer for personalization.
New features in development, e.g.
Table extraction and search
Algorithm extraction and search
![Page 3: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/3.jpg)
Outline
EvolutionA brief discussion of history, features, advances.
ArchitectureDescription of components, modules of SeerSuite.
WorkflowIdentify steps in adding documents
DeploymentSeerSuite as CiteSeerx – deployment, interface,
federation and usage.
![Page 4: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/4.jpg)
Digital LibrariesDigital libraries (DLs) continue to grow and be used
Cyberinfrastructure for scientists and academics
Google Scholar is very popular & to some invaluable
Publisher collections ACM portal, Scopus, etc.
Library of Congress (NDLP)
Document acquisition
Author submissions RePec (economics).
ArXiv (physics)
Web harvesting (Crawler based) CiteSeerX (mostly computer science)
crawls author homepages, not publishers
Google Scholar, considerable data acquired from publishers.
![Page 5: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/5.jpg)
SeerSuite Architecture
Web Application (View, Controllers)
Data Storage(Index, Database, Repository)
Metadata Extraction(Extraction, Ingestion, DOI)
![Page 6: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/6.jpg)
Architecture DetailsWeb Applications
Built using the Java Spring framework,
jsp, javascript (dojo, mootools) for presentation.
Servlets/Controllers
Data StorageRepository (files)
Index (fast search)
Database (graph, metadata)
Extraction and IngestionPDF to Text conversion (pdfbox, TET).
Converted documents filtered.
![Page 7: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/7.jpg)
Architecture Details
Extraction and IngestionSupport Vector Machines for document metadata,
CRF for citation extraction.
DOI – Unique internal identification of documents
CrawlerHeritrix with a Java Message Service based system
over ActiveMQ.
MaintenanceKeep graph, index, services updated, external links.
![Page 8: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/8.jpg)
Workflow
![Page 9: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/9.jpg)
www.psu.edu
Seed
Focused Crawler
Fetch
http://uninterestingplace.edu
Not Visited
giles.ist.psu.edu/publications
User Submission
Crawl-M
Focused Crawling
![Page 10: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/10.jpg)
Crawl-M
PDF to TEXTTEXT
Filter
TEXT
TEXT REFParsCit(CRF)
HEADER
HeaderParser(SVM)
Citation&
Contexts
Metadata Extraction
Conversion Filtering
![Page 11: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/11.jpg)
Crawl-M
HEADER
Citation&
Contexts
XML Builder
XML
Ingestion
Duplicate Check
CHECKSUM
Database Repository
DOIDBDOI
Ingestion
![Page 12: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/12.jpg)
metadata
TEXT
metadata
Database
Document
Index
Maintenance: Indexing
Update
![Page 13: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/13.jpg)
Deployment: CiteSeerx
Off-the-shelf-hardware
x86 based servers, DAS storage
Linux
Redhat Cluster Suite (GNBD/GFS)
Tomcat platform
Web applications/
Interfaces (OAI/API)
Database
MySQL RDBMS
Indexing
Solr
![Page 14: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/14.jpg)
User InterfaceSeveral interface views
Search
− Access to the full text of all documents,
− citations,
− Authors.
− Ranked by user criterion.
Document Summary
− Presents document metadata,
− Citations
− Citation graphs,
− Links to copies
− Links to other bibliography sources.
Citation Relationships
− Co-citations
− Active bibliography
![Page 15: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/15.jpg)
Search
Search Bar
Result
Criterion
![Page 16: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/16.jpg)
Document Summary
Citations
DocumentDetails
Downloadsand External
Links
BibTeX
Citation Graph
myCiteSeer Launch Points
![Page 17: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/17.jpg)
Citation Relationships
Citation Relationship - Co-Citation
![Page 18: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/18.jpg)
MyCiteSeer Interface
A personal portal space for usersTrack and Manage
− User defined collections
− Tags
− Search queries
Correct document metadata.
Monitor documents.
Generate API keys.
Planned features New interface
More extensive metadata.
![Page 19: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/19.jpg)
MyCiteSeer
Menu
![Page 20: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/20.jpg)
Other Interfaces: OAI - PMH
Programmatic Access – metadata is always in high demand.
A low barrier mechanism, was supported by CiteSeer
Extend the existing framework to support OAI.
CGI with embedded database vs. Servlets with DAO, more efficient and simpler implementation.
OAI-2 with Dublin Core format.
Many harvesters available for OAI-2.
![Page 21: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/21.jpg)
APIAPI is central to programmatic access to SeerSuite.
Exposes relationships and data elements.Implements a REST based service providing access to
Document metadata (docid)
Authors (aid),
Citations (cid),
Key-words, and citation contexts are provided.
Built using the Jersey library (JAX-RS)
Uses MyCiteSeer
Control access to API.
Limits number of queries per day.
![Page 22: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/22.jpg)
Federation of Services
CiteSeerx provides services not part of SeerSuite
Consequence of constant research and development.
Infrastructure shared with SeerSuite Web app framework, Data storage: Database, Repository.
Service examples:
Table search – from TableSeer
Disambiguated author search
Future services: Algorithm search, Figure search, Citation recommendation, etc.
![Page 23: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/23.jpg)
Table SearchTable extraction
Table caption and content
Table searchIngestion extracted table
− Database and Index.
Link table with document
IndexSeparate from document index.
Other infrastructure part of SeerSuite
Template for newer services
Embedded table
Document
![Page 24: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/24.jpg)
Disambiguated Author SearchAuthor Disambiguation
Essential to identify and attribute records accurately.− Which M. Johnson to cite?.
Algorithms constantly in development DBSCAN and LASVM
Uses co-authorship, header information (address, affiliation)
Upcoming method includes Random Forests and is online.
Separate index.
Other infrastructure part of SeerSuite
![Page 25: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/25.jpg)
Usage - Traffic2 million hits on
average every day. Images, javascript
dominate.
Downloads and Document summaries are popular.
Search has the highest variation.
MyCiteSeer receives little traffic (< 1% of total.)
6/01/20096/19/2009
7/12/20097/30/2009
8/17/20099/04/2009
9/22/200910/10/2009
10/28/200911/15/2009
12/03/200912/21/2009
1/08/20101/26/2010
2/13/20103/03/2010
3/21/20104/08/2010
4/26/2010
0.0E+0
1.0E+6
2.0E+6
3.0E+6
4.0E+6
5.0E+6
6.0E+6
Traffic
Download Other Search Summary
![Page 26: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/26.jpg)
Usage – Country DistributionTraffic from all over
the globe.
US dominates
Germany, China, India, Taiwan, UK are other sources of traffic.
Most of the external referrals are from search engines – Google, Google Scholar, Yahoo, Bing.
Traffic by Country
DistributionPLMYCHRUNLIRAUBRESITKRJPCAFRGBINCNDETWUS
![Page 27: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/27.jpg)
CollaborationSeerSuite is a collaborative effort
Collaborators (no mirrors)
− University or Arkansas, National University of Singapore, King Saud University host independent copies of CiteSeerx.
Research directions User interface
Metadata extraction and ranking
Information aggregation
Entity disambiguation
Trend monitoring
Citation recommendations
CiteSeerx data available upon request (rsync)
Documents, databases, anonymized logs.
Data sharing Cornell, CMU, MIT, University College London, NSWC, others.
![Page 28: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/28.jpg)
Lessons LearnedMulti-tier architecture, open source applications can
be used to build scalable, reliable and robust services.
Need for virtualization – cost effective.
Data requests – building API's important.
Federated services make adopting new services possible.
Metadata extraction – always room for improvement
Optimizations implemented allow better performance.
Several improvements such as UI and performance enhancements possible
Heavily used but not heavily implemented (SeerSuite)
![Page 29: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/29.jpg)
Conclusions and SummaryOverview of SeerSuite
Architecture, Workflow, Deployment, UI, other interfaces including OAI, API
Federation of servicesTable search
Author disambiguation
Others planned
Analysis of usage of CiteSeerx
Collaboration
Lessons Learned
Download SeerSuite !
![Page 30: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/30.jpg)
Availability of Code
Released under Apache Software Foundation License (version 2).
Code for SeerSuite and related software available on Source forgehttp://sourceforge.net/projects/citeseerx
Virtual Machine with a deployment of SeerSuitehttp://singularity.ist.psu.edu:8080/seerlab.html
Support by the research group at Penn State
![Page 31: SeerSuite: Developing a Scalable and Reliable Application … · 2019-02-25 · SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries](https://reader034.fdocuments.in/reader034/viewer/2022042115/5e91ea910fbe4c4ebe70a449/html5/thumbnails/31.jpg)
Q & A