Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial...

55
Semantic Scholar Document Analysis at Scale Miles Crawford, Director of Engineering

Transcript of Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial...

Page 1: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Semantic ScholarDocument Analysis at Scale

Miles Crawford, Director of Engineering

Page 2: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Outline

● Introduction to the Allen Institute for Artificial Intelligence and Semantic Scholar

● Research at Semantic Scholar

● Creating www.semanticscholar.org

● Other resources for researchers

Page 3: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Introduction to AI2 and S2

Page 4: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

“AI for the common good”

Page 5: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Allen Institute for Artificial Intelligence

MosaicCommon Sense Knowledge

and Reasoning

AristoMachine Reading and Question Answering

EuclidMath and Geometry

Comprehension

AllenNLPDeep Semantic NLP

Platform

PRIORVisual Reasoning

Semantic ScholarAI-Based Academic

Knowledge

Page 6: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Semantic Scholar: Vision & Strategy

Semantic Scholar makes the world's scholarly knowledge easy to survey and consume.

Page 7: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Semantic Scholar: Vision & Strategy

Scale:Attract and retain a significant and sustainable share of academic search traffic

Differentiation:S2 is dramatically better at surveying, extracting, and helping researchers consume the most relevant information from the world's research

Impact on research with AI:Work towards a “Wright Brothers” moment for research through research on novel AI techniques that are prototyped with millions of active users

Page 8: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

semanticscholar.org

Page 9: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Research at Semantic Scholar

Page 10: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Research at S2: Three Levels of Analysis

Paper MacroRelationships

Page 11: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Paper: Extract meaningful structures

Peters et al. ACL 2017 -- Semi-supervised Sequence Tagging with Bidirectional Langua…Ammar et al. SemEval 2017 -- Semi-supervised End-to-end Entity and Relation Extrac…Siegel et al. JCDL 2018 -- Extracting Scientific Figures with Distantly Supervised Neural…

Figures

Tables

Topics

Omniglot BackpropagationNeural NetworksRelations

C-peptide [contraindicated with] Diabetes MellitusResults

The addition of MbPA reaches a test perplexity of 29.2 which is, to the authors’ knowledge, state-of-the-art at time of writing.

Page 12: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Paper: Extract meaningful structures

Page 13: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Relationships: Establishing Connections

Ammar et al. NAACL 2018 -- Construction of the Literature Graph in Semantic ScholarWang et al. BioNLP 2018 -- Ontology Alignment in the Biomedical Domain Using Entit…Bhagavatula et al. NAACL 2018 -- Content-Based Citation Recommendation

uses method

Discovered KB

should cite

UMLS

Ontology Matching

Page 14: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Relationships: Establishing Connections

Page 15: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Relationships: Establishing Connections

Page 16: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Macro: Trends in research

Kang et al. NAACL 2018 -- A Dataset of Peer Reviews (PeerRead): Collection, Insights...Feldman et al. arXiv 2018 -- Citation Count Analysis for Papers with Preprints

pre-publishing vs. citation rates

understanding peer reviews

quantify demographic bias in clinical trials

Page 17: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

2018 PublicationsA Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP ApplicationsKang, Ammar, Dalvi, van Zuylen, Kohlmeier, Hovy, SchwartzNAACL 2018

Content-Based Citation RecommendationBhagavatula, Feldman, Power, Ammar. NAACL 2018

Construction of the Literature Graph in Semantic ScholarAmmar, Groeneveld, Bhagavatula, …, EtzioniNAACL 2018

Ontology Alignment in the Biomedical Domain Using Entity Definitions and ContextWang, Bhagavatula, Neumann, Lo, Wilhelm, AmmarBioNLP 2018

Extracting Scientific Figures with Distantly Supervised Neural NetworksSiegel, Lourie, Power, AmmarJCDL 2018

Neural Relation Extraction using a Combination of Full Supervision and Distant SupervisionBeltagy, Lo, AmmararXiv preprint 2018

Citation Count Analysis for Papers with PreprintsFeldman, Lo, AmmararXiv preprint 2018

Page 18: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Bringing it All to Production

Page 19: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

semanticscholar.org: Scale and Reach

● Comprehensive literature graph to explore papers, authors, topics.

● View and search across 40MM+ Computer Science & BioMedical academic papers.

● Expansion to all of science in 2019.

● Topic extraction covering 350k+ topics, 230MM+ mentions, and 6.7MM+ relationships.

● Used by over 2.2MM unique users per month.

● Global adoption.

Page 20: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

User’s Browser

Web Application

Web Crawler

Data Acquisition

Crawler

DBLP

Etc...Etc...Etc...

The Internet

Machine Learning Models

Pipeline

Junk FilterFigures

Entities

Elasticsearch

PubMed

Springer

Metadata

MetadataMetadataEtc...

Create Citation Graph

Join Extracted Data

De-duplicate& Cluster

Create Search Index

API

Node Server

React Application

React Application

Architecture Overview

Page 21: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Data Acquisition

Crawler

Etc...Etc...Etc...

PubMed

Springer

Sourced Paper

Title: N/A

Authors: N/A

PDF: a2ho-4.pdf

etc...

Sourced Paper

Title: Findings of...

Authors: A, B, C

PDF: findings.pdf

etc...

AmazonS3

Sourced Paper

Title: Analyzing...

Authors: A, B, C

PDF: N/A

etc...

Normalizing Paper Records

Page 22: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Pipeline

Junk Filter

Create Citation Graph

Join Extracted

Data

De-duplicate& Cluster

Create Search Index

Title Block: analyzscientificdocument

Title Block: deeplearningfornlp

Title: Analyzing Scientific DocumentsAuthor: John Doe

Title: Analyzing scientific documentsAuthor: J. Doe

Title: Analyze Scientific DocumentsAuthor: Jane Smith

Title: Deep Learning for NLPAuthor: Jane Smith

Title: Analyzing Scientific DocumentsAuthor: John Doe

Title: Analyze Scientific DocumentsAuthor: Jane Smith

Title: Deep Learning for NLPAuthor: Jane Smith

Distributed Paper Resolution

Page 23: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Pipeline

Junk Filter

Create Citation Graph

Join Extracted

Data

De-duplicate& Cluster

Create Search Index

Title Block: analyzscientificdocument

Title Block: deeplearningfornlp

Title: Analyzing Scientific DocumentsAuthor: John Doe

Title: Analyze Scientific DocumentsAuthor: Jane Smith

Title: Deep Learning for NLPAuthor: Jane Smith

Title: Structured Content ExtractionAuthor: John Doe

References:1. Analyzing Scientific

Documents, Doe et al., SCIDOCA 2016

2. Deep Learning with NLP, Smith, NAACL 2017

3. Structural Heuristics in Academic Documents, Oscar, SCIDOCA 2016

?

Distributed Citation/Reference Matching

Page 24: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Model Task System

Figure Extraction

v1.0

Figure Extraction

v1.0Figure

Extractionv1.0

Figure Extraction

v1.1Figure

Extractionv1.0

Topic Location

v1.3Figure

Extractionv1.0

Key Results

v1.0Figure

Extractionv1.0

Etc...

Paper 1: abcPaper 2: defPaper 3: Paper 4:

Paper 1: Paper 2:Paper 3:Paper 4:

Paper 1: abcPaper 2: Paper 3:Paper 4:

Paper 1: abcPaper 2: defPaper 3: hijPaper 4:

Etc...

Docker containers on autoscaling host groups:

Table of extraction inputs and results per model:

Notifications for new or changed papers

Scalable ML ModelsPipeline

Junk Filter

Create Citation Graph

Join Extracted

Data

De-duplicate& Cluster

Create Search Index

Page 25: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Other Resources

Page 26: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Open Research Corpus

● Full Corpus of Papers:○ Some data restricted due to some licensing/copyright conditions.○ Record-per-line JSON archive, 36GB.○ http://labs.semanticscholar.org/corpus/

● API:○ Lightweight single paper or author access.○ JSON over REST API.○ Link resolver for easy linking to Semantic Scholar.○ http://api.semanticscholar.org/

Page 27: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Use of Open Corpus as Training Set

An independent researcher created a system for embedding papers as a vector to support similarity measurements and recommendation generation

https://github.com/Santosh-Gupta/Research2Vec

Page 28: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

ArXiv Added a Bibliography via API

ArXiv used our API to fetch and re-display citation and reference lists on their pages.

Page 29: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Open Source Repositories

Cite-o-matic: Find relevant citations for drafts of papers.https://github.com/allenai/citeomatic

Science Parse v1: CRF-based extraction of metadata from PDFs.https://github.com/allenai/science-parse

Science Parse v2: LSTM-based extraction of metadata from PDFs.https://github.com/allenai/spv2

Deep Figures: Extract figures and tables from papers.https://github.com/allenai/deepfigures-open

Page 30: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

What would you like to see?

Page 31: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Thank you!

[email protected]

http://allenai.org

https://semanticscholar.org

Page 32: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Appendix

Page 33: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

1st dataset of peer reviews

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi Mishra, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz. NAACL 2018

Page 34: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Prepublishing before paper gets accepted is associated with 65% more citations!

Sergey Feldman, Kyle Lo, Waleed Ammar. arXiv 2018

Page 35: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Gender bias in clinical trials

Page 36: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. NAACL 2018

Cite-o-matic

Page 37: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Synthesize knowledge

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. NAACL 2018 (industry track)

Literature graph

Page 38: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Lucy Wang, Chandra Bhagavatula, Mark Neumann, Kyle Lo, Chris Wilhelm and Waleed Ammar. BioNLP 2018

Ontology alignment

Page 39: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Context-sensitive word representations

Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. ACL 2017Waleed Ammar, Matthew E. Peters, Chandra Bhagavatula, Russell Power. SemEval 2017

Page 40: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Figure extraction

Noah Siegel, Nicholas Lourie, Russell Power, Waleed Ammar. JCDL 2018.

Page 41: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

Page 42: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

Address more paper FAQs (e.g., PICO elements).

Next steps

Page 43: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

Model salience of extracted structures at the document level.

Next steps

Page 44: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback.

Next steps

Page 45: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback.

Next steps

Page 46: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

1. Extract Meaningful Structures

2018 2019

User’s implicit/explicit feedback.

Next steps

Page 47: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Ganin and Lempitsky. ICML’15

Next steps

1. Extract Meaningful Structures

2018 2019

Adapt existing models to more domains.

Next steps

Page 48: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

2. Establish Connections

2018 2019

Next steps

Page 49: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

2. Establish Connections

2018 2019

More work on ontology matching and deduplication.

Next steps

Page 50: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

2. Establish Connections

2018 2019

More work onauthor disambiguation.

Next steps

Page 51: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

2. Establish Connections

2018 2019

Implement v1.0 of Knowledge Graph QueryingExample query: find researchers with first-author publications in ACL or NIPS since 2017, worked on knowledge base completion or question answering since 2010, and are affiliated with a university in the US.

Next steps

Page 52: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

3. Macro Analysis

2018 2019

Next steps

Page 53: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

3. Macro Analysis

2018 2019

Continue analyzing biases in clinical trials.

Next steps

Page 54: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

3. Macro Analysis

2018 2019

Study influence of government vs. industry funding.

Next steps

Page 55: Miles Crawford, Director of Engineering Semantic Scholar · Allen Institute for Artificial Intelligence Mosaic Common Sense Knowledge and Reasoning Aristo Machine Reading and Question

Next steps

3. Macro Analysis

2018 2019

Study evolution and influence of research communities.

Next steps