From Content Publishing to Data Solutions via Machine Learning · Read Search Do this this this...
Transcript of From Content Publishing to Data Solutions via Machine Learning · Read Search Do this this this...
2019-02-19Bradley Allen, Chief Architect, Elsevier
From Content Publishing to Data Solutions via Machine Learning
Presentation to Los Angeles Machine Learning Meetup
Twenty-three years ago“It’s hard to imagine a sweeter business than publishing academic journals. The editorial content is contributed free of charge by scholars desperate to publish to get tenure. School libraries are automatic customers—professors insist on it. ... Is the party over? It may be nearing its end. The Internet is closing in.”
- Forbes, December 18, 1995
Read Search Dothis this this
Cell
Fundamentals
Gray‘s Anatomy
ScienceDirect
Scopus
ClinicalKey
Reaxys
Sherpath
Mendeley
Knovel
Today: from content publishing to data solutions
‘You could use this treatment to save a life’Clinicians
‘This article answers your questions’Researchers
‘This is the research to invest in’Governments
‘This is the cancer treatment you should pursue’Pharmaceuticalcompanies
‘This is the area you need to improve to qualify’Nursing students
Our five main customer segments
1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization
Life-saving drugs are expensive to develop.3
Global research spend is growing every year.1
3.4%from2015
Predictedspend
$1.9TNresearch in 2016
Studies:70-80% of
research asks thewrong questions
or cannot bereproduced
Researchers lack the tools they need to be effective.2
Preventable medicalerrors:Third largest cause of death in theUS
Health providers cannot save lives without the bestinformation.4
$2.5BNmedian pharmaceuticalspend perdrug
1/20successrateof drugs
HeartDisease
611kCancer
585k
MedicalError
225k 149k
RespiratoryIllness
The challenges our customers face
The assets we have at handContent Technology
Chemistry database500m published experimental facts
User queries13m monthly users on ScienceDirect
Books35,000 published booksDrug Database100% of drug information from pharmaceutical companies updated daily
Research16% of the world’s research data and articles published by Elsevier
1,000 technologists employed by Elsevier
Machine learningOver 1,000 predictive models trained on 1.5 billion electronic health care events
Machine reading475m facts extracted from ScienceDirect
Collaborative filtering:1bn scientific articles added by 2.5mresearchers analyzed daily to generate over250m article recommendations
Semantic EnhancementKnowledge on 50m chemicals captured as 11B facts
How we think about delivering data solutions
Determine the question (including
use case and personae)
Describe the data that needs to be produced
to address the question
If we have that data, reuse it
If not, use the data we have to
create it
If we don’t have data we need, acquire what we’re missing
From Justin O’Beirne, “Google Maps’ Moat – How far ahead of Apple Maps is Google Maps?”, 2017-12. Retrieved from https://www.justinobeirne.com/google-maps-moat on 2018-05-31.
Breaking it down into eight simple steps
• Market Definition: Determine target market personae & product features
• Use Case Definition: Describe tasks performed by personae yielding use cases
• Data & Query Specification: Describe data schemas & features to support use cases
• Knowledge Delivery: Deliver query & visualisation of data
• Data Enhancement: Extract entities, attributes & relations, map entities to ontologies & taxonomies
• Data Linking: Link extracted entities with other entities in existing enterprise data
• Knowledge Graph Construction: Store mapped & linked data for access & discovery
• Data Acquisition: Acquire content & data in multiple formats from multiple sources
1
2
45
6
7
3
8
Knowledge graphs make it all hang togetherI really believe that the key battleground in any industry is that of its knowledge graph. Google has it for media/advertising, Netflix has it for filmed entertainment, Uber has it for inner city transportation, Facebook has it across social media as well as messaging and the multiples speak for themselves.
Tony Askew, Founder/Partner at REV (personal communication, September 29, 2016)
The role that machine learning (ML) plays• Our goal is to drive business by enabling better outcomes through:
− Delivery of timely, appropriate advice for decision making & problem solving
− Enhanced discovery and query over massive amounts of information
• We plan to achieve this by using ML to build knowledge graphs that enable the rapid development of data solutions− Implementing entity/object extraction, relation extraction, entity disambiguation,
classification, and sentiment analysis
− Based on the scientific & medical literature, experimental data, and the data exhaust associated with the practice of scientific communication & medical practice
Breaking down our ML efforts• Early wins
− Deployed systems adding value to existing products and solutions• Roofshots
− Task-specific use of ML to improve discoverability, knowledge delivery• Practicalities
− Human-in-the-loop NLP pipelines augmented with ML components to scale entity and relation extraction, entity linking for knowledge graph construction
• Moonshots− Use of multi-task learning architectures to develop a general-purpose approach
to question answering from the scientific and medical literature and from experimental data
Early win: Recognizing decision graphs in medical content• Clinical Key is Elsevier’s flagship
medical reference search product• Clinicians prefer “answers” in the form
of tables or flowcharts− Eliminates need to page through retrieved
content to find actionable information• Clinical Key provides a sidebar section
displaying answers, but this feature depends on very labor-intensive manual curation
• Solution: automatically classify images in medical content corpus at index time
• Benefits: lower cost and improved user experience
12
Early win: Recognizing decision graphs in medical content• Perfect fit for transfer learning approach
− Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph
− Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc.
− Reusing all but the last two Dense layers of a pre-trained model (VGG-CNN, available from Caffe’s “model zoo”)
− VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House)
− Last layer is a multinomial logistic regression (or softmax) classifier
• Model trained on 10,167 images with a 70/30 train/test split• Achieves 93% test set accuracy
− Evaluated image + caption text model but did not get a big performance boost
• Searchable image base used to support training set and model development
Early win: Generating topic pages from scientific content
Take a ScienceDirect
article
Take a taxonomy
Find occurrences of concepts
Definition
Snippet(s)
Early win: Generating topic pages from scientific content
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Extracting clinically useful relationships from medical content
PulmonaryEmbolism Dyspnea
has Clinical Finding
Three clinical symptoms were considered to be highly suggestive of PE: recent dyspnoea, recent chest painand unusual tachycardia >75/min.
Roofshot: Extracting clinically useful relationships from medical content
CNN implemented in KerasInput: relationship labels and syntax paths linking relation arguments, semantic taggingEmbed in 64-dimensional space(like Word2vec)Compute 1-dimensional convolution to learn path structurePerform final softmax activation to predict one of N relationsSemantic analysis using FPE annotation
Syntactic analysis with spaCy on Apache Spark
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Assistants for the interpretation of pathology imagery
What we needAnnotated Raw Images
Notice the multiple subependymal nodules in fig 3.
What we haveImages with their captions
Roofshot: Assistants for the interpretation of pathology imagery
Practicality: Building continuous modeling and quality control into our deployment workflows
Practicality: Training our development squads
Objective Key Results
Provide data and software engineers with baseline understanding of ML/DS concepts and techniques
Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Yellow Belt course from ~5% to 80%
Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Green Belt course from ~2% to 50%
Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Blue Belt course from <1% to 20%
Judo Belt Level Course
Yellow Python for Data Science and Machine Learning Bootcamp
Green Data Analysis with Pandas and Python
Blue One of: Scala and Spark for Big Data and Machine Learning orSpark and Python for Big Data with PySpark
Brown Deep Learning with Tensorflow
Task Technology ExamplesMarket Definition Web analytics platforms Adobe Analytics
Sales, marketing & CRM platforms Salesforce
Use Case Definition Enterprise wikis Confluence
Collaboration platforms Slack
Data & Query Specification Notebook-based data science platforms Databricks, Jupyter, RunKit
Data Acquisition Web & LOD crawlers Nutch
ETL tools Talend, AWS Glue, Apache Spark
Workflow automation frameworks Activiti, Apache Airflow, AWS Mechanical Turk
Log-based data processing frameworks Apache Flume, Logstash, Kafka, AWS Redshift
Data Enhancement NLP & ML packages & services Ad hoc string or regular expression matching using standard language libraries, FPE, MedScan, OpenNLP, NLTK
Ontology, taxonomy & data management tools PoolParty, TopBraid Composer, Gitlab, Github
Data Linking Entity linking packages & services Ad hoc string and regular expression matching using standard language libraries, NLP/ML/DL algorithms
Knowledge Graph Construction Graph stores (including RDF & property graph stores) GraphDB, JanusGraph, DataStax Graph, Neo4J, AWS Neptune
Search engines Solr, ElasticSearch, managed search services
Linked data REST servers Nginx, Apache, AWS Lambda
Knowledge Delivery Data visualization & query applications Kibana, Tableau, D3
Web application frameworks Angular, React, Express
Practicality: Evaluating and selecting technologies
Moonshot: matrix factorization for relation extraction
p =
83
r = 176
83 x 176 sparse binary-valued matrix with 366 entries
surface form relations
structured relations
entity pairs
Content
Universal schema
Surface form relations
Structured relations
Factorization model
Matrix Construction
Open Information Extraction
Entity Resolution
Matrix FactorizationKnowledge
graph
Curation
Predicted relations
Matrix Completion
Taxonomy Triple Extraction
14M articles from Science Direct
3.3M facts
475M facts
49M facts920K concepts from EMMeT
glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss
Latent factor matrix
r = 176
p =
83
Latent factor matrix
�
83 x 176 real-valued matrix with 14,608 entries
�
diseases 2791370 glaucoma have been documented to cause contact dermatitis 3815093 diseasesdiseases 2791370 glaucoma is assessed through evaluation 5415395 qualifierdiseases 2791370 glaucoma progresses more rapidly than primary open-angle glaucoma 8247149 diseasesdiseases 2791370 glaucoma recommend treatment 5216597 proceduresdiseases 2791370 glaucoma supports the assumption that oxidative stress 8184588 diseasesdiseases 2791370 glaucoma is the death of retinal ganglion cells 8002088 anatomy
Moonshot: Question answering (QA) as THEProblem“Question answering (QA) is a complex natural language processing task which requires an understanding of the meaning of a text and the ability to reason over relevant facts. Most, if not all, tasks in natural language processing can be cast as a question answering problem …“
• Ankit Kumar, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. The 33rd International Conference on Machine Learning (ICML 2016), 2016
Moonshot: Question answering (QA) as THEProblem• Currently evaluating BERT
(specifically, BioBERT)
• Transformer-style architectures seem to be unreasonably effective
• However, their efficacy with scientific and medical content is unknown
• Early results are inconclusive as reproducibility is challenging
The hidden opportunity• People that want to do ML need lots and lots of high-quality content & data
− The more the better• Algorithms for ML are commodities• Data for ML is expensive• Commissioning, authoring and curating data for ML is a new digital
revenue opportunity
My working hypothesis
As machines become increasingly capable of general-purpose language understanding, the burden of effort in building machine intelligence will shift from writing software to curating content.
Scholarly publishing in the next twenty years
model
learning architecture
data
Publishing content for people
Publishing data for machines
context
knowledge
content
GPU
Stepping back: A personal perspective on AI• We’re in an era of tremendous expectations
− The Deep Learning Era (2010-?)• I’ve (we’ve) been here before
− The Expert Systems Era (1984-1992)• How is it different now and what does that mean for our ambitions?
Some things are the same
Expert Systems Era Deep Learning EraExpensive hardware Lisp Machines Cloud TPUsExpensive people Knowledge
engineersData scientists
Geostrategic calls to action
Japan China
Some things are dramatically different
Expert Systems Era Deep Learning Era# of computers connected to the Internet
102 109
FLOPS 109 1014
Bits/USD 104 109
We’re in a world where it is much, much easier to build and field AI applications...• A global culture of technology • (Relative) homogeneity of hardware and software platforms and packages• Ubiquity of networked hardware• Open source software packages and platforms• Open access to published results• Increased reproducibility and transferability of results
− Though not everywhere
... but we’re beginning to see hints of trouble• The huge strides of the last decade have been in machine perception, but not in machine reasoning
• While AI systems are cheaper to build and easier to field, they are often still brittle
− Self-driving cars and unjustified levels of trust in automated systems
− Adversarial examples• Learning from data is yielding unanticipated consequences
− The politics and economics of dataset creation
− Biased models from biased data• On top of this, we’re witnessing the end of the Moore‘s Law era
− Progress through faster, cheaper hardware may be much slower to come• And, oh yeah... IBM Watson
Should we be worried?• The Expert Systems Era was perceived as a failure...
− Fielded systems were brittle, hard to maintain, and often didn’t address real customer needs− Projected short-term benefits for businesses did not live up to the hype
• ... but was it really?− If it works, it isn’t AI anymore: Ed Feigenbaum’s story about the Pulmonary Function Advisory
Expert System• So no, but we need to be aware that AI applications can involve hard problems that will
take many business cycles to solve− Speech recognition from 1970 to today− Robot walking from Marc Raibert’s Leg Lab to Boston Dynamics
• Artificial General Intelligence: a problem for another generation, perhaps one in another century
No, but we should be focused and deliberate• Always work backwards from real customer needs to define an application
− Did I mention IBM Watson?• Deal with the challenges of data acquisition and quality first
− Be proactive with respect to fairness and privacy• Design applications to mitigate brittleness
− Start simple
− Augmentation before automation
− Human-in-the-loop
− Alter the problem and environment to make the problem more manageable• Leverage your differentiating strengths
− Our leadership in editorial processes for high quality trusted content
− Our deep subject matter expertise
Summary• Knowledge graphs are a way of sharing knowledge between people and
machines, and the battleground on which dominance in markets will be established
• We’re using machine learning to build knowledge graphs for use by researchers, medical professionals, and students
• There is much work underway and a lot yet to be done
Thank you
[email protected]@bradleypallen