From Content Publishing to Data Solutions via Machine Learning · Read Search Do this this this...

2019-02-19Bradley Allen, Chief Architect, Elsevier

From Content Publishing to Data Solutions via Machine Learning

Presentation to Los Angeles Machine Learning Meetup

Twenty-three years ago“It’s hard to imagine a sweeter business than publishing academic journals. The editorial content is contributed free of charge by scholars desperate to publish to get tenure. School libraries are automatic customers—professors insist on it. ... Is the party over? It may be nearing its end. The Internet is closing in.”

- Forbes, December 18, 1995

Read Search Dothis this this

Cell

Fundamentals

Gray‘s Anatomy

ScienceDirect

Scopus

ClinicalKey

Reaxys

Sherpath

Mendeley

Knovel

Today: from content publishing to data solutions

‘You could use this treatment to save a life’Clinicians

‘This article answers your questions’Researchers

‘This is the research to invest in’Governments

‘This is the cancer treatment you should pursue’Pharmaceuticalcompanies

‘This is the area you need to improve to qualify’Nursing students

Our five main customer segments

1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization

Life-saving drugs are expensive to develop.3

Global research spend is growing every year.1

3.4%from2015

Predictedspend

$1.9TNresearch in 2016

Studies:70-80% of

research asks thewrong questions

or cannot bereproduced

Researchers lack the tools they need to be effective.2

Preventable medicalerrors:Third largest cause of death in theUS

Health providers cannot save lives without the bestinformation.4

$2.5BNmedian pharmaceuticalspend perdrug

1/20successrateof drugs

HeartDisease

611kCancer

585k

MedicalError

225k 149k

RespiratoryIllness

The challenges our customers face

The assets we have at handContent Technology

Chemistry database500m published experimental facts

User queries13m monthly users on ScienceDirect

Books35,000 published booksDrug Database100% of drug information from pharmaceutical companies updated daily

Research16% of the world’s research data and articles published by Elsevier

1,000 technologists employed by Elsevier

Machine learningOver 1,000 predictive models trained on 1.5 billion electronic health care events

Machine reading475m facts extracted from ScienceDirect

Collaborative filtering:1bn scientific articles added by 2.5mresearchers analyzed daily to generate over250m article recommendations

Semantic EnhancementKnowledge on 50m chemicals captured as 11B facts

How we think about delivering data solutions

Determine the question (including

use case and personae)

Describe the data that needs to be produced

to address the question

If we have that data, reuse it

If not, use the data we have to

create it

If we don’t have data we need, acquire what we’re missing

From Justin O’Beirne, “Google Maps’ Moat – How far ahead of Apple Maps is Google Maps?”, 2017-12. Retrieved from https://www.justinobeirne.com/google-maps-moat on 2018-05-31.

https://www.justinobeirne.com/google-maps-moat

Breaking it down into eight simple steps

• Market Definition: Determine target market personae & product features

• Use Case Definition: Describe tasks performed by personae yielding use cases

• Data & Query Specification: Describe data schemas & features to support use cases

• Knowledge Delivery: Deliver query & visualisation of data

• Data Enhancement: Extract entities, attributes & relations, map entities to ontologies & taxonomies

• Data Linking: Link extracted entities with other entities in existing enterprise data

• Knowledge Graph Construction: Store mapped & linked data for access & discovery

• Data Acquisition: Acquire content & data in multiple formats from multiple sources

1

2

45

6

7

3

8

Knowledge graphs make it all hang togetherI really believe that the key battleground in any industry is that of its knowledge graph. Google has it for media/advertising, Netflix has it for filmed entertainment, Uber has it for inner city transportation, Facebook has it across social media as well as messaging and the multiples speak for themselves.

Tony Askew, Founder/Partner at REV (personal communication, September 29, 2016)

The role that machine learning (ML) plays• Our goal is to drive business by enabling better outcomes through:

− Delivery of timely, appropriate advice for decision making & problem solving

− Enhanced discovery and query over massive amounts of information

• We plan to achieve this by using ML to build knowledge graphs that enable the rapid development of data solutions− Implementing entity/object extraction, relation extraction, entity disambiguation,

classification, and sentiment analysis

− Based on the scientific & medical literature, experimental data, and the data exhaust associated with the practice of scientific communication & medical practice

Breaking down our ML efforts• Early wins

− Deployed systems adding value to existing products and solutions• Roofshots

− Task-specific use of ML to improve discoverability, knowledge delivery• Practicalities

− Human-in-the-loop NLP pipelines augmented with ML components to scale entity and relation extraction, entity linking for knowledge graph construction

• Moonshots− Use of multi-task learning architectures to develop a general-purpose approach

to question answering from the scientific and medical literature and from experimental data

Early win: Recognizing decision graphs in medical content• Clinical Key is Elsevier’s flagship

medical reference search product• Clinicians prefer “answers” in the form

of tables or flowcharts− Eliminates need to page through retrieved

content to find actionable information• Clinical Key provides a sidebar section

displaying answers, but this feature depends on very labor-intensive manual curation

• Solution: automatically classify images in medical content corpus at index time

• Benefits: lower cost and improved user experience

12

Early win: Recognizing decision graphs in medical content• Perfect fit for transfer learning approach

− Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph

− Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc.

− Reusing all but the last two Dense layers of a pre-trained model (VGG-CNN, available from Caffe’s “model zoo”)

− VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House)

− Last layer is a multinomial logistic regression (or softmax) classifier

• Model trained on 10,167 images with a 70/30 train/test split• Achieves 93% test set accuracy

− Evaluated image + caption text model but did not get a big performance boost

• Searchable image base used to support training set and model development

Early win: Generating topic pages from scientific content

Take a ScienceDirect

article

Take a taxonomy

Find occurrences of concepts

Definition

Snippet(s)

Early win: Generating topic pages from scientific content

Roofshot: Extracting clinically useful relationships from medical content


PulmonaryEmbolism Dyspnea

has Clinical Finding

Three clinical symptoms were considered to be highly suggestive of PE: recent dyspnoea, recent chest painand unusual tachycardia >75/min.


CNN implemented in KerasInput: relationship labels and syntax paths linking relation arguments, semantic taggingEmbed in 64-dimensional space(like Word2vec)Compute 1-dimensional convolution to learn path structurePerform final softmax activation to predict one of N relationsSemantic analysis using FPE annotation

Syntactic analysis with spaCy on Apache Spark

Roofshot: Assistants for the interpretation of pathology imagery

What we needAnnotated Raw Images

Notice the multiple subependymal nodules in fig 3.

What we haveImages with their captions

Roofshot: Assistants for the interpretation of pathology imagery

Practicality: Building continuous modeling and quality control into our deployment workflows

Practicality: Training our development squads

Objective Key Results

Provide data and software engineers with baseline understanding of ML/DS concepts and techniques

Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Yellow Belt course from ~5% to 80%

Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Green Belt course from ~2% to 50%

Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Blue Belt course from <1% to 20%

Judo Belt Level Course

Yellow Python for Data Science and Machine Learning Bootcamp

Green Data Analysis with Pandas and Python

Blue One of: Scala and Spark for Big Data and Machine Learning orSpark and Python for Big Data with PySpark

Brown Deep Learning with Tensorflow

Task Technology ExamplesMarket Definition Web analytics platforms Adobe Analytics

Sales, marketing & CRM platforms Salesforce

Use Case Definition Enterprise wikis Confluence

Collaboration platforms Slack

Data & Query Specification Notebook-based data science platforms Databricks, Jupyter, RunKit

Data Acquisition Web & LOD crawlers Nutch

ETL tools Talend, AWS Glue, Apache Spark

Workflow automation frameworks Activiti, Apache Airflow, AWS Mechanical Turk

Log-based data processing frameworks Apache Flume, Logstash, Kafka, AWS Redshift

Data Enhancement NLP & ML packages & services Ad hoc string or regular expression matching using standard language libraries, FPE, MedScan, OpenNLP, NLTK

Ontology, taxonomy & data management tools PoolParty, TopBraid Composer, Gitlab, Github

Data Linking Entity linking packages & services Ad hoc string and regular expression matching using standard language libraries, NLP/ML/DL algorithms

Knowledge Graph Construction Graph stores (including RDF & property graph stores) GraphDB, JanusGraph, DataStax Graph, Neo4J, AWS Neptune

Search engines Solr, ElasticSearch, managed search services

Linked data REST servers Nginx, Apache, AWS Lambda

Knowledge Delivery Data visualization & query applications Kibana, Tableau, D3

Web application frameworks Angular, React, Express

Practicality: Evaluating and selecting technologies

Moonshot: matrix factorization for relation extraction

p =

83

r = 176

83 x 176 sparse binary-valued matrix with 366 entries

surface form relations

structured relations

entity pairs

Content

Universal schema

Surface form relations

Structured relations

Factorization model

Matrix Construction

Open Information Extraction

Entity Resolution

Matrix FactorizationKnowledge

graph

Curation

Predicted relations

Matrix Completion

Taxonomy Triple Extraction

14M articles from Science Direct

3.3M facts

475M facts

49M facts920K concepts from EMMeT

glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss

Latent factor matrix

r = 176

p =

83

Latent factor matrix

�

83 x 176 real-valued matrix with 14,608 entries

�

diseases 2791370 glaucoma have been documented to cause contact dermatitis 3815093 diseasesdiseases 2791370 glaucoma is assessed through evaluation 5415395 qualifierdiseases 2791370 glaucoma progresses more rapidly than primary open-angle glaucoma 8247149 diseasesdiseases 2791370 glaucoma recommend treatment 5216597 proceduresdiseases 2791370 glaucoma supports the assumption that oxidative stress 8184588 diseasesdiseases 2791370 glaucoma is the death of retinal ganglion cells 8002088 anatomy

Moonshot: Question answering (QA) as THEProblem“Question answering (QA) is a complex natural language processing task which requires an understanding of the meaning of a text and the ability to reason over relevant facts. Most, if not all, tasks in natural language processing can be cast as a question answering problem …“

• Ankit Kumar, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. The 33rd International Conference on Machine Learning (ICML 2016), 2016

Moonshot: Question answering (QA) as THEProblem• Currently evaluating BERT

(specifically, BioBERT)

• Transformer-style architectures seem to be unreasonably effective

• However, their efficacy with scientific and medical content is unknown

• Early results are inconclusive as reproducibility is challenging

The hidden opportunity• People that want to do ML need lots and lots of high-quality content & data

− The more the better• Algorithms for ML are commodities• Data for ML is expensive• Commissioning, authoring and curating data for ML is a new digital

revenue opportunity

My working hypothesis

As machines become increasingly capable of general-purpose language understanding, the burden of effort in building machine intelligence will shift from writing software to curating content.

Scholarly publishing in the next twenty years

model

learning architecture

data

Publishing content for people

Publishing data for machines

context

knowledge

content

GPU

Stepping back: A personal perspective on AI• We’re in an era of tremendous expectations

− The Deep Learning Era (2010-?)• I’ve (we’ve) been here before

− The Expert Systems Era (1984-1992)• How is it different now and what does that mean for our ambitions?

Some things are the same

Expert Systems Era Deep Learning EraExpensive hardware Lisp Machines Cloud TPUsExpensive people Knowledge

engineersData scientists

Geostrategic calls to action

Japan China

Some things are dramatically different

Expert Systems Era Deep Learning Era# of computers connected to the Internet

102 109

FLOPS 109 1014

Bits/USD 104 109

We’re in a world where it is much, much easier to build and field AI applications...• A global culture of technology • (Relative) homogeneity of hardware and software platforms and packages• Ubiquity of networked hardware• Open source software packages and platforms• Open access to published results• Increased reproducibility and transferability of results

− Though not everywhere

... but we’re beginning to see hints of trouble• The huge strides of the last decade have been in machine perception, but not in machine reasoning

• While AI systems are cheaper to build and easier to field, they are often still brittle

− Self-driving cars and unjustified levels of trust in automated systems

− Adversarial examples• Learning from data is yielding unanticipated consequences

− The politics and economics of dataset creation

− Biased models from biased data• On top of this, we’re witnessing the end of the Moore‘s Law era

− Progress through faster, cheaper hardware may be much slower to come• And, oh yeah... IBM Watson

Should we be worried?• The Expert Systems Era was perceived as a failure...

− Fielded systems were brittle, hard to maintain, and often didn’t address real customer needs− Projected short-term benefits for businesses did not live up to the hype

• ... but was it really?− If it works, it isn’t AI anymore: Ed Feigenbaum’s story about the Pulmonary Function Advisory

Expert System• So no, but we need to be aware that AI applications can involve hard problems that will

take many business cycles to solve− Speech recognition from 1970 to today− Robot walking from Marc Raibert’s Leg Lab to Boston Dynamics

• Artificial General Intelligence: a problem for another generation, perhaps one in another century

No, but we should be focused and deliberate• Always work backwards from real customer needs to define an application

− Did I mention IBM Watson?• Deal with the challenges of data acquisition and quality first

− Be proactive with respect to fairness and privacy• Design applications to mitigate brittleness

− Start simple

− Augmentation before automation

− Human-in-the-loop

− Alter the problem and environment to make the problem more manageable• Leverage your differentiating strengths

− Our leadership in editorial processes for high quality trusted content

− Our deep subject matter expertise

Summary• Knowledge graphs are a way of sharing knowledge between people and

machines, and the battleground on which dominance in markets will be established

• We’re using machine learning to build knowledge graphs for use by researchers, medical professionals, and students

• There is much work underway and a lot yet to be done

Thank you

[email protected]@bradleypallen

From Content Publishing to Data Solutions via Machine Learning · Read Search Do this this this...

Documents

Transcript of From Content Publishing to Data Solutions via Machine Learning · Read Search Do this this this...