NLP & ML Webinar

61
This webinar is being recorded

Transcript of NLP & ML Webinar

Page 1: NLP & ML Webinar

This webinar is being recorded

Page 2: NLP & ML Webinar

Natural Language

Processing and Machine

Learning: Beyond the Hype

A Pistoia Alliance Debates Webinar

Moderated by David Milward –Linguamatics

September 14, 2017

Page 3: NLP & ML Webinar

This webinar is being recorded

Page 4: NLP & ML Webinar

Poll Question 1: What role do you play in

your company?

A. IT

B. Data scientist/bioinformatician

C. Clinical/bench scientist

D. Information professional

E. Other

Page 5: NLP & ML Webinar

© P

isto

ia A

llia

nce

The Panel

5

David Milward, Ph.D., CTO LinguamaticsDavid Milward is chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Chengyi Zheng, Ph.D. , NLP Specialist Kaiser PermanenteChengyi Zheng, PhD, is a NLP specialist at the Kaiser Permanente Southern California. He has worked on over 30 research projects using the electronic health records (EHR) data from millions of patients. He is the principal investigator of a CDC funded study involving 5 health care institutions on using NLP in the vaccine safety studies. He was the winner of the Kaiser Permanente predictive modeling competition. He ranked the 1st place in the innovation competition (InnoCentive@Lilly) while served as the biomedical informatics scientist at Eli Lilly. He was trained in computer science with a concentration on speech recognition. He will share some experiences on using NLP and Machine learning on EHR for outcomes prediction.

Eugene Myshkin, Ph.D., Senior Research Scientist, ClarivateEugene Myshkin, PhD, is a senior scientist in bioinformatics at Clarivate Analytics. He

has over 15 years experience in drug discovery, cheminformatics and bioinformatics. He

has also been involved in a number of text mining projects including mining of chemical

reagents and antibodies from scientific

literature.

September 14, 2017 NLP and ML

Page 6: NLP & ML Webinar

© P

isto

ia A

llia

nce

Agenda

6

• AI, NLP and ML (David)

• Using NLP and ML in clinical research (Chengyi)

• Network and pathway driven machine learning

approaches to biomarker discovery and patient

stratification (Eugene)

6September 14, 2017 NLP and ML

Page 7: NLP & ML Webinar

NLP, AI and Machine Learning

David Milward, PhD

CTO, Linguamatics

2017

Page 8: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

DLAI

ML

NLP DS

© 2017 Linguamatics8

Page 9: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

AI

© 2017 Linguamatics9

Page 10: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

AI

NLP

© 2017 Linguamatics10

Page 11: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

DLAI

ML

© 2017 Linguamatics11

Page 12: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

AI

NLP DS

© 2017 Linguamatics12

Page 13: NLP & ML Webinar

Overview

AI (Artificial Intelligence)NLP (Natural Language Processing)

− and its applications in life sciences

ML (Machine Learning) and DL (Deep Learning) NLP to feed ML-based DS (Decision Support) ML in NLP

AI

ML

NLP

© 2017 Linguamatics13

Page 14: NLP & ML Webinar

Artificial Intelligence (AI)

Artificial intelligence is intelligence exhibited by machines

The central goals of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects

As machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, leading to the quip “AI is whatever hasn't been done yet”

Wikipedia

© 2017 Linguamatics14

Page 15: NLP & ML Webinar

Natural Language Processing (NLP)

Processing of natural languages e.g. English, French, Chinese by computers

NLP is part of AI, but also key to other areas of AI e.g. providing decision support

− If 80% of knowledge is unstructured

we need NLP to get the right information

to provide good suggestions

− Currently many AI projects are limited: they

can only address questions where there is

structured data

− Worse, they often use inappropriate

structured data such as ICD billing codes for

non-billing tasks

© 2017 Linguamatics15

Page 16: NLP & ML Webinar

Find information however it is expressed

© 2017 Linguamatics16

Different word, same meaning

cyclosporine

ciclosporin

Neoral

Sandimmune

Different expression, same meaning

Non-smoker

Does not smoke

Does not drink or smoke

Denies tobacco use

Different grammar, same meaning

5mg/kg of cyclosporine daily

5mg/kg/d of cyclosporine

cyclosporine 5mg/kg/day

Same word, different context

Diagnosed with diabetes

Family history of diabetes

No family history of diabetes

NLP

Page 17: NLP & ML Webinar

Represent it in a standardized form

© 2017 Linguamatics17

Concept Text Normalized Value

Diseases breast cancer Breast Neoplasm

carcinoma of the breast

Genes Raf-1 RAF1

Raf I

Dates 27th Feb 2014 20140227

2014/02/27

Measurements 0.2g 200 mg

Two hundred milligrams

Mutations Val 158 Met V158M

Val by Met at codon 158

Entrez Gene ID: 5743inhibits

nimesulide, a selective COX2 inhibitor, …

Page 18: NLP & ML Webinar

From Bench to Bedside: NLP Provides Insight

© 2017 Linguamatics18

Regulatoryapproval

Phase 3Clinicaltrials

Basic research

Idea Patientcare

Phase 2Phase 1

DeliveryDevelopmentDiscovery

Business critical questions

What targets are involved in bone cancer?

What companies are patenting a particular technology?

What are the safety risks of my drug?

Where can I site my Phase 1, Phase 3 clinical study?

What are the clinical risks for my patients?

Page 19: NLP & ML Webinar

Direct access to the Unstructured

© 2017 Linguamatics

Weight ≥ 80kg

Below 60 years old

Reports after 2010

With mutation C677T

Cancer patients

19

Page 20: NLP & ML Webinar

Machine Learning

Machine Learning is used for AI in general and as a technique within NLP

3 main flavours:

− Supervised

− uses annotated data mapping between inputs and outputs

− Semi-supervised

− uses machine analysis but incorporates a human in the loop

− Unsupervised

− uses unannotated data, usually at very large scale.

© 2017 Linguamatics20

Recent successes with deep learning approaches based on neural networks for supervised and unsupervised ML e.g.

− Machine translation using parallel

corpora

− Image classification in medicine

Page 21: NLP & ML Webinar

Using NLP to feed other AI

NLP provides access to the 80% of information in unstructured text

Provides a set of potential features to be used in e.g. ML models for Decision Support

Example: building risk models from RWD sets

− Predicting patients at risk of misusing opioid

prescription drugs (AMIA November 2017)

− Features extracted by Linguamatics I2E from

8.9 million de-identified medical record full-

text transcripts from RealHealthData

− SVM classifier trained on the features to flag

patients at risk

© 2017 Linguamatics21

Page 22: NLP & ML Webinar

Machine Learning in NLP

Supervised ML − Requires large-scale, representative annotated documents

− Main paradigm for core NLP components

− For extraction patterns, used in academic systems but less commonly

in commercial

Semi-supervised ML − Useful for new tasks or data sets where no existing representative

annotated data

− Useful where a task is initially ill-defined

− Puts a human in the loop judging suggestions from the machine

learning

− Can provide good quality results quickly e.g. to test whether a feature

extracted by NLP is useful for a ML model

Unsupervised ML − Uses large-scale unannotated data

− Key example is learning the meaning of a word via the context it

keeps (word embeddings)

© 2017 Linguamatics22

Page 23: NLP & ML Webinar

Semi-Supervised ML Approaches

Similar distributions for words and syntactic constructions

Automatically discover what is in the data using an interactive, agile text mining platform such as Linguamatics I2E

A long tail of infrequent cases

− prioritize the more frequent constructions

− generalize to cover items in the tail

© 2017 Linguamatics23

Zipf’s Law: the frequency of any word is inversely proportional to its rank in the frequency table

Page 24: NLP & ML Webinar

Semi-Supervised NLP using Linguamatics I2E

© 2017 Linguamatics24

Page 25: NLP & ML Webinar

Summary

NLP is critical to success of many ML projects

− access to the unstructured text is key to using ML

widely, not just where there is convenient structured

data

Semi-supervised approaches to NLP provide an efficient way to capture features for ML projects

© 2017 Linguamatics25

DLAI

ML

NLP DS

Page 26: NLP & ML Webinar

Poll Question 2: What is your company’s

primary use for NLP?

A. Early Discovery/ Pre-clinical

B. Clinical

C. Real world data

D. Other

E. Don’t use NLP

Page 27: NLP & ML Webinar

Using NLP and ML in clinical researchChengyi Zheng, PhD, MS

DEPARTMENT of Research & Evaluation

Page 28: NLP & ML Webinar

28 DEPARTMENT of Research & Evaluation

10/6/2012 10/19/2012

10/7/2012 10/14/2012

10/7/2012

Pt called

10/7/2012

Nurse Called Back

10/8/2012

Orthopedic office visitWhere: Medical Center, Department

10/8/2012

Progress Notes:Reason for visit: Knee Pain

Vital Sign/BMI/Pain level/HistoryPE/Findings/Impression/A&P

Dx: icd-9 codeNurse Exam Note:

10/9/2012

Lab

10/10/2012

Pre-op dental exam (ext)

10/6/2012

Imaging:DEXA Bone density

10/11/2012

office visit

10/11/2012

Rx Prescribed

10/10/2012

Surgery Scheduled

10/11/2012

Office VisitSinus CongestionAnkle itchyDx: 401.9 Essential Hypertension274.9 Gout461.9 Acute Sinusitis

10/12/2012

Picked up the Rx

10/13/2012

Pt missed appt.

10/13/2012

Telephone ConsultHealthy bones PN

10/14/2012

Pt emailed:Drug adverse event

10/14/2012

Pt calledcancerous area

10/15/2012

EKGDx: Screening

10/15/2012

Ear Wax Wash

10/18/2012

Pathology Report Out

10/16/2012

Procedure:Remove Skin

10/16/2012 - 10/19/2012

Hospitalization

Two weeks records of a patient in an EMR system

5 Ws: What, Who, When, Where and Why

Membership length: 70% > 5 years, 50% >10 years.

Page 29: NLP & ML Webinar

29

5 Ws: What, Who, When, Where and Why What

– What is the reason of visit?

– What happened? (pain after fall, pain after drink a beer?)

Who

– Who is the caregiver? (primary physician, rheumatologist?)

– What we know about this patient? (age, race, past medical history, et. al.)

Where

– Where this visit occurred?

When

– When the problem started?

Why

– Why this problem happened? Possible causes?

DEPARTMENT of Research & Evaluation

Page 30: NLP & ML Webinar

30

Visual representation of KPSC research databases

DEPARTMENT of Research & Evaluation

Page 31: NLP & ML Webinar

31

Case study: Identify acute gout flare

Published methods to identify gout flares using claims data

– Clinical coding is unreliable: under-coding, over-coding, too general

– Medication is unreliable:

Drugs for gout maintenance

Drugs also for other diseases (Share similar symptoms)

NLP has been used to:

– Identify study population and patients information

– Identify and extract clinical variables (genetic, biopsy, radiology)

– Evaluate patients status (disease progression, medication status)

DEPARTMENT of Research & Evaluation

Page 32: NLP & ML Webinar

Solution and challenges (NLP)

Challenges:

– Gout is a chronic disease which can be controlled but not cured

Signs and symptoms could appeared in follow up visit

Differentiate between acute and chronic status

– Gout population is generally old with comorbidity sharing similar symptoms

100+ types of arthritis (> 50 million people)

Pain, erythema, and swelling joint

– Information documented varies by clinical notes

Standard solutions:

– Each search query captures one set of information

– Each search query has its own sensitivity/specificity etc.

– Logic operator combines search results (union, join, etc.)

Difficult to optimize on the overall sensitivity/specificity etc.

32 DEPARTMENT of Research & Evaluation

Page 33: NLP & ML Webinar

Mining vs. NLP & ML in clinical research

Steps:

1. Preliminary analysis, estimate feasibility

2. Develop plan, estimate cost

3. Seek permit (government vs. IRB)

4. Mine (mining equipment vs. NLP)

– Focus on completeness (high sensitivity)

– Shallow & deep mining (good specificity)

5. Refine (chemical process vs. ML)

– Improve purity (higher specificity)

6. Manual verification (optional)

7. Deliver to customer“art and science combined” “resource-heavy and time-consuming

process” 33 DEPARTMENT of Research & Evaluation

Page 34: NLP & ML Webinar

Solution and challenges (NLP+ML)

Goal:

NLP focus on sensitivity or information completeness

– Separate ores from rock

ML focus on improving the specificity

– Improve purity without much loss of sensitivity

Solution:

NLP results as input features to the ML system

– Identify related signs and symptoms

– Identify temporal relationship (when and how long?)

– Identify disease association (related to any other disease?)

– Identify implicit and explicit mention of gout flare

– Identify treatment plan associated with disease onset

34 DEPARTMENT of Research & Evaluation

Page 35: NLP & ML Webinar

Overview of the system development steps

35

Study period: 1/1/2007 to 12/31/2010. Patients > 18 years, with a diagnosis of gout and on urate-lowering therapy. Within [-3,+12] months of index date, 599,317 clinical notes for 16,519 patients.

DEPARTMENT of Research & Evaluation

Page 36: NLP & ML Webinar

Overview of the NLP+ML system

36 DEPARTMENT of Research & Evaluation

Page 37: NLP & ML Webinar

Performance comparisons

81.1

95.488.3

92.290.997.3

93 96.5

84.892.2

81.1

93.9

70

80

90

100

Sensitivity Specificity PPV NPV

Clinical note level gout flare identification

Rheumatologist 1 Rheumatologist 2 NLP+ML

37

98.592.9

97.1 96.397.192.9

97.192.9

98.5 96.4 98.5 96.4

88.2 89.395.2

75.8

70

80

90

100

Sensitivity Specificity PPV NPV

Identify patients with ≥ 1 gout flares

Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9

74.2

92.382.1 88.283.9

95.4 89.7 92.593.584.6

74.4

96.5

41.9

95.481.3 77.5

30

50

70

90

Sensitivity Specificity PPV NPV

Identify patients with ≥ 3 gout flares (refractory gout)

Rheumatologist 1 Rheumatologist 2 NLP+ML ICD-9

DEPARTMENT of Research & Evaluation

Page 38: NLP & ML Webinar

Results

Note level (gout flare, n= 599,317):

– NLP: 49,415 positive cases => ML: 18,869 positive cases

Patient level (with ≥ 3 flares, n=16,519):

– Number of patients: 1,402 (NLP+ML) vs. 516 (Claim)

– Sensitivity: 93.5% (NLP+ML) vs. 41.9% (Claim)

Impact:

– Identify refractory disease patients

– Estimate market size (KPSC / US population = 4.5/325 million =1.4%)

– Better disease management, improve quality of life, and help reduce healthcare resource use.

1,402 patients is more manageable than 16,519 patients

38 DEPARTMENT of Research & Evaluation

Page 39: NLP & ML Webinar

39

ML in healthcare

Tremendous opportunities

Prediction: high utilizers, risk scores

Identification: cases, outcomes, social needs

Image recognition: pathology and radiology images

– Challenges (Data)

Data quality: dirty, missing data

Heterogeneous data: different systems

Structured, semi-structured and free text data

Image, scanned documents

Genetic and biobank data

– Challenges (People)

Who understands NLP, ML and healthcare

Who understands the complexity of healthcare data

DEPARTMENT of Research & Evaluation

Page 40: NLP & ML Webinar

Poll Question 3: How does your company

primarily use machine learning in drug

discovery?

A. Target prediction and repositioning

B. Biomarker discovery

C. Patient stratification

D. Other

E. We don’t use machine learning

Page 41: NLP & ML Webinar

Network and pathway

driven machine learning

approaches to biomarker

discovery and patient

stratification

Eugene Myshkin, PhD

September 2017

Page 42: NLP & ML Webinar

42CLARIVATE ANALYTICS TEXT MINING

• Clarivate Analytics literature data feed• Comprehensive coverage

– >20,000 journals

– Journal content mirrors: Current Contents; Web of Science; Biosis; International Pharmaceutical Abstracts;

Derwent Drug File

– http://ip-science.thomsonreuters.com/cgi-bin/jrnlst/jlresults.cgi?PC=MASTER

• Latest information– Updated with over 170,921 articles/month, or 2,051,051+ articles/year

• Full text, cover to cover searching of all journals

• Comprehensive synonym collections

• Controlled vocabulary management software to support mining

Page 43: NLP & ML Webinar

43

CLARIVATE ANALYTICS LIFE SCIENCES SOLUTIONS

Pharmacovigilance Literature Monitoring Biological and Chemical Reagent Monitoring

Concepts in social media Automated Curation of Clinical

Data

Protein and Gene Variant

Monitoring

Page 44: NLP & ML Webinar

44

USING NLP FOR MANUSCRIPT MATCHING

Analyze citation connections to place the publication in the right journal

Page 45: NLP & ML Webinar

45

DRUG TARGET DISEASE

PITFALLS OF NLP FEATURES FOR ML• 1-10 million of features• Feature vectors are binary and sparse• Feature redundancy• Feature selection takes a long time

These associations can be obtained with NLP but precision is a problem -a flood of false positives and the necessity to hire a bunch of people just to sort the true from the false alerts.

FOCUS OF DRUG DISCOVERY:

Page 46: NLP & ML Webinar

46

METABASE MANUALLY ANNOTATED CONTENT

PUBLICATIONS

(209 for EGF-EGFR interaction)

•Manual annotation from publications•Team of PhDs, MDs•Advanced editorial systems•Controlled vocabularies•Multiple levels of QC•invested more than 400 man years MOLECULAR

INTERACTION

NETWORK:

PATHWAY

~ 1,500,000 molecular interactions

~ 3,000 pathways

Page 47: NLP & ML Webinar

47

INTEGRATED APPROACH

Pathway knowledgePathway-driven

approaches

Statistical approaches

1. Target identification or repositioning2. Biomarker discovery3. Patient stratification

Page 48: NLP & ML Webinar

48

Drug toxic but beneficial

Drug toxic but NOT beneficial

Drug NOT toxic and beneficial

Drug NOT toxic and NOT beneficial

Patient stratification

“The most efficient and safe drug for a cohort of patients”

WHY DIFFERENT PATIENT RESPONSE?

Blockbuster strategy

“One drug for all patients”

New strategy is needed

Page 49: NLP & ML Webinar

49

—HOW CAN PATIENTS BE STRATIFIED?

Mechanism 1 Mechanism 2

Biomarkers Biomarkers

Biomarker – measurable molecular indicator of:disease subtype/progress

drug efficacyside effect/toxicity

• Identify subtypes resulting in multiple drug targets rather than one.

• A shift from the presumption of a disease to multiple diseases would reframe the drug development strategy

Page 50: NLP & ML Webinar

50

ORION BIONETWORKS

Orion Bionetworks (Cohen Veteran Biosciences) is an alliance of world leading organizations in patient care, computational modelling, translational research and patient advocacy that aims to develop open-source computational models for multiple sclerosis and improve upon existing analytical tools for model development.

~186 subjects with gene expression data and clinical parameters like time to relapse, etc

GOALS:

Understand the structure of the population based on

the molecular data – identify cohorts of patients whose

clinical course differs over time

Build stratification models

Identify new therapeutic targets

Page 51: NLP & ML Webinar

51

NETWORK/PATHWAY BASED METHODS FOR BIOMARKER DISCOVERY

Page 52: NLP & ML Webinar

52

1. PATHWAY IDENTIFICATION

— 56 pathways identified

• 136 genes

• 39/136 genes were present in multiple pathways

• 44/136 genes known MS biomarkers or drug targets (p =

5x10-6)

52

• individual expression values of each member gene were averaged into a combined z-score

• activity score association with time to relapse in a Cox proportional hazard model was calculated

Page 53: NLP & ML Webinar

53

2. PATIENTS CLUSTERING BY PATHWAYS

Clusters are significantly associated with time to relapse in the presence of important clinical covariates

patients were clustered into groups based on k-means clustering of their pathway activity profiles, k=3 resulted in the best separation of patient profiles.

Page 54: NLP & ML Webinar

54

— A K-Nearest Neighbor model was previously generated to predict

risk groups 1-3 using all biomarkers

— Feature selection was performed by taking the variable importance

calculated from the trained KNN model.

— Forward feature selection was then conducted using 10-fold CV

adding features to the model in order of their importance.

— Once this process was complete the predictive performance was

evaluated in terms of the ability of the model to separate the three

risk groups

— Final feature set was applied to test data

3. CLASSIFICATION MODEL

Signature was reduced from 56 to 13 pathways, containing 65 genes

Page 55: NLP & ML Webinar

GENE ONLY MODEL WAS NOT ROBUST TO TEST DATA

PATHWAY BASED APPROACH GENE BASED APPROACH

Page 56: NLP & ML Webinar

56

CONCLUSIONS

— Signature differentiating between patient cohorts was reduced

from 56 to 13 pathways

— This new signature contains 65 genes

— 13 biomarkers could stratify subjects into risk groups with

statistically significant differences in time to relapse

— This was validated in test subjects with results being consistent

to what was observed in the training cohort

— Pathway activities were more robust than gene expression

56

Page 57: NLP & ML Webinar

Poll Question 4: What is the greatest

barrier to application of NLP/ML at your

company?

A. Technical expertise

B. Access to data

C. Data quality

D. Management support/understanding

E. Other

Page 58: NLP & ML Webinar

Poll Question 5: Do you expect an

increase in ML within Life Science in the

next 2 years?

A. Yes

B. No

C: Don’t Know

Page 59: NLP & ML Webinar

Audience Q&APlease use the Question function in GoToWebinar

Page 60: NLP & ML Webinar

Where will AI/Deep learning

have an impact in Life Science

& Health?

The next Pistoia Alliance Debates Webinar:

Moderator: Nick Lynch with Sean Ekins CEO, Collaborations

Pharmaceuticals Inc, David Pearah, CEO HDF group, and Peter Henstock,

Pfizer Research

Date: September 27, 2017

check http://www.pistoiaalliance.org/pistoia-alliance-debates-webinar-

series/ for the latest information

Page 61: NLP & ML Webinar

[email protected] @pistoiaalliance www.pistoiaalliance.org