20140113 q uchemxseerseminar
Transcript of 20140113 q uchemxseerseminar
ChemXSeer: Digital library tools, features, and crawling characteristics
Edward A. Fox
Professor, Computer Science, Virginia Tech
Blacksburg, VA 24061 USA
[email protected] http://fox.cs.vt.edu
and
Sagnik Ray Choudhury
Ph.D. Student, College of Information Science and Technology, Penn State, USA
[email protected] 13 Jan. 2014 -- QU Library, Doha, Qatar 1
Outline
• Acknowledgments • Introduction • ELISQ • Technology
13 Jan. 2014 -- QU Library, Doha, Qatar 2
HTTP://WWW.QU.EDU.QA/
HTTP://WWW.TAMU.EDU/ HTTP://WWW.PSU.EDU/ HTTP://WWW.VT.EDU/
Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ
13 Jan. 2014 -- QU Library, Doha, Qatar 3
Sponsored by Qatar University Library
HTTP://qnl.qa
Acknowledgments
• Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University
• Dr. Rashid Alammari, Dean, College of Engineering, Qatar University
• Dr. Moumen Hasnah , Director of Academic Research, Qatar University
• Dr. Imad Bachir, Qatar University Library Director
• Prof. Sebti Foufou, Head of Department of Computer Science and Engineering, Qatar University
• Prof. Ramazan Kahraman, Head of the Department of Chemical Engineering, Qatar University
13 Jan. 2014 -- QU Library, Doha, Qatar 4
Additional Thanks
13 Jan. 2014 -- QU Library, Doha, Qatar 5
QScience – providing collection:
Christopher J. Leonard, Editorial Director
Paul Coyne, CTO
US National Science Foundation (recent and current grants to Fox): • IIS-1319578 • IIS-0916733 • DUE-0840719 • OCI-1032677 • plus those to PSU, TAMU
Outline
• Acknowledgments • Introduction • ELISQ • Technology
13 Jan. 2014 -- QU Library, Doha, Qatar 6
Introduction
• Digital libraries have emerged since 1991. • Now each major publisher has its own
digital library; many others exist too. • Related systems include:
• Institutional repositories, e.g., at QU • Content & courseware management systems
• Research and development funding of hundreds of millions of dollars has led to powerful tailored systems, such as for chemical information.
13 Jan. 2014 -- QU Library, Doha, Qatar 7
8 13 Jan. 2014 -- QU Library, Doha, Qatar
9
Information Life Cycle
Authoring
Modifying
Organizing
Indexing
Storing
Retrieving
Distributing
Networking
Retention
/ Mining
Accessing
Filtering
Using
Creating
13 Jan. 2014 -- QU Library, Doha, Qatar
10
Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing
Annotating Classifying Clustering Evaluating Extracting Indexing
Measuring Publicizing
Rating Reviewing (peer)
Surveying Translating
(language)
Conserving Converting
Copying/Replicating Emulating Renewing
Translating (format)
Acquiring Cataloging
Crawling (focused) Describing Digitizing
Federating Harvesting Purchasing Submitting
Preservational Creational
Add
Value
Repository-Building
Information Satisfaction
Services
Infrastructure Services
13 Jan. 2014 -- QU Library, Doha, Qatar
Outline
• Acknowledgments • Introduction • ELISQ • Technology
13 Jan. 2014 -- QU Library, Doha, Qatar 11
ELISQ – Electronic Library Institute – SeerQ –– Project Team
Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI)
Sumaya Ali S A Al-Maadeed (Ph.D., PI)
Myrna Tabet
Asad Nafees
Tahseena Moideen
This project was made possible by NPRP Grant # 4 - 029 - 1 – 007 from the Qatar National Research Fund (a member of Qatar Foundation).
Virginia Tech, USA:
Edward Fox (Ph.D., Lead-PI)
Tarek Kanan
Penn. State University, USA:
C. Lee Giles (Ph.D., PI)
Sagnik Ray Choudhury
Texas A&M, USA:
Richard Furuta (Ph.D., PI)
Hamed Alhoori
13 Jan. 2014 -- QU Library, Doha, Qatar 12
Consultants:
John Impagliazzo (Ph.D., Key Investigator)
Susan Lukesh (Ph.D.)
Carole Thompson
Qatar National Library, Qatar:
Claudia Lux (PI)
Krishna RoyChowdhury
Postdoc - TBA
Project Objectives/Aims
A. Research and prototype digital library systems and infrastructure for Qatar, focusing initially on Qatari information related to government and scholarly activities.
Leverage the crawling engine from Penn State‘s SeerSuite software infrastructure, and extend it beyond its current focus on English to support Arabic-English collections, and to cover a broad range of scholarly disciplines, and all types of government information.
13 Jan. 2014 -- QU Library, Doha, Qatar 13
ELISQ Project (1 of 2)
Project Objectives/Aims (continued)
B. Research and build the digital library community in
Qatar, supporting digital library use, services, collection development, tailored systems, and advancing toward a Knowledge Society.
Study scholarly activities, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consulting center at the proposed Institute, and collaborative efforts with libraries and museums in Qatar, we will identify particular needs and uses, and tailor collections, systems, and services, to lead toward the Qatari Knowledge Society.
13 Jan. 2014 -- QU Library, Doha, Qatar 14
ELISQ Project (2 of 2)
Outline
• Acknowledgments • Introduction • ELISQ • Technology
13 Jan. 2014 -- QU Library, Doha, Qatar 15
Crawler (Heritrix) (for search engines & Web archives)
• A Web crawler starts with a list of URLs to visit, called the seeds.
• On those page, identifies all the hyperlinks
• adds them to the list of URLs to visit
• recursively visits pages pointed to
• according to a set of policies.
• Prioritizes its downloads – some pages change often.
13 Jan. 2014 -- QU Library, Doha, Qatar 16
Selected SeerSuite Instantiations
• CiteSeerx • http://citeseerx.ist.psu.edu
• A scientific literature digital library and search engine
• ChemXSeer • http://chemxseer.ist.psu.edu
• Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools
• ArchSeer • http://archseer.ist.psu.edu/
• Archeology literature
• TableSeer
13 Jan. 2014 -- QU Library, Doha, Qatar 17
http://citeseerx.ist.psu.edu CiteSeerX
• 3 M documents
• Ms of files
• 60 M citations
• 3 to 6 M authors
• 2 to 4 M hits day
• 100K documents added monthly
• 800K individual users
• several Tbytes
• CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science
• Converts PDF to text • Automatically extracts OAI metadata and other data • Automatic citation indexing, links to cited documents, creation of document page, author disambiguation • Software open source – can be used to build other such tools
13 Jan. 2014 -- QU Library, Doha, Qatar 18
13 Jan. 2014 -- QU Library, Doha, Qatar 19
13 Jan. 2014 -- QU Library, Doha, Qatar 20
SeerSuite
• Tool kit used to build search engines and digital libraries • CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer,
AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc.
• Built on commercial grade open source tools (Solr/Lucene) • Penn State expertise – automated specialized metadata
extraction
• Supports research in • Indexing and search • Data mining & structures • Information and knowledge extraction • Social networks: Name/entity disambiguation • Scientometrics/infometrics • Systems engineering • User interface design (HCI = human-computer interaction) • Software engineering and management
22
SeerSuite is not Google
• Metadata (as in library catalogs) as well as content
• Sets of collections, rather than the Web as a whole • Provided by a curator (e.g., publisher, museum)
• Provided by user submissions
• Or collected by focused ‘crawling’
• Tailored services, rather than the same for everyone • Browsing using categories, preserving, adding value
• Based on studying user requirements, e.g., chemists
• Working with entities, rather than just words • Citations, tables, figures, names, chemical formula
• Using knowledge bases, machine learning, artificial intelligence
13 Jan. 2014 -- QU Library, Doha, Qatar
Questions for Us?
• http://elisq.qu.edu.qa/
• http://fox.cs.vt.edu
13 Jan. 2014 -- QU Library, Doha, Qatar 23
Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez
Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology
Pennsylvania State University, University Park, PA, USA
Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical
http://chemxseer.ist.psu.edu
Talk Overview
● Challenges and Motivation. ● Functionalities
– Fulltext Search – Author Search – Table Search – Figure Search – Expertise Search – Chemical Name and Formula Tagging – Chemical Name and Formula Search
● Summary.
Based on cyberinfrastructure for CiteSeerX
Built on Solr/Lucene, SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or statistical data in scientific documents.
• Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml.
Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …)
Y. Liu, et.al, AAAI 2007, JCDL 2007.
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction and Search
Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow
search on them can provide the following: • Increases our understanding of key concepts of papers. • Provides data for automatic comparative analyses. • Enables regeneration of figures in different contexts. • Enables search for documents with figures containing specific experiment
results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
Our Contribution
ChemXSeer Name and Formula Extraction and Search
• Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.
• Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification) • Structural knowledge (substructure finding and search)
B. Sun, et.al., WWW 2007, WWW 2008, TOIS
Chemical Entity Extraction and Tagging
● Name tagging – Each chemical name can be a phrase
– Example ● "... Determination of lactic acid and ...“
● "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..."
● Formula tagging – Each formula is a single term
– Example ● "... such as hydroxyl radical OH, superoxide ..."
– Non-formula example ● "... YSI 5301, Yellow Springs, OH, USA ... ”
● Tagging examples – Name tagging:
"... of <name-type>lactic acid</name-type> and ...“
– Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
Online Chemical Entity Tagger
● We have an open source chemical name and formula tagger and a web based interface for evaluation.
● The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
Online Chemical Entity Tagger: Chemical Name Tagging Example
● Results on a sample PDF.
● Some chemical formula erroneously identified as chemical name (loss of precision).
● High recall (most chemical names identified)
Online Chemical Entity Tagger: Chemical Formula Tagging Example
● Results on a sample PDF.
● Some chemical formulas not identified (loss of recall).
● High precision (words identified as formula are actual formulas)
Chemical Name Indexing and Search
● Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index
substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.
• Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”.
Example Formula Search
http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Treeratpituk, Chen, JCDL’13
Future Work
Lots of interesting work to do! Few computer/machine learning scientists involved.
• Acquisitions - more documents, data, knowledge • Chemical 3D graph search • Fundamental chemical graph representation analysis • Table data storage and access • Figure search and data extraction and access • New data and feature search
• spectra, experimental methods, instrumentation • New documents: 400K PubMed • Semantic chemical graphs • Expert/collaborator search • Search integration of all features