HOW CAN BIOMEDICAL TEXT MINING HELP THE NOTIFY LIBRARY€¦ · Biomedical Text Mining Text mining...
Transcript of HOW CAN BIOMEDICAL TEXT MINING HELP THE NOTIFY LIBRARY€¦ · Biomedical Text Mining Text mining...
><HOW CAN BIOTM HELP THE NOITFY LIBRARY
HOW CAN BIOMEDICAL TEXT MINING HELP THE NOTIFY LIBRARY ?
SEMI-AUTOMATED CURATION WORKFLOWINTERACTIVE AND SEMANTIC INDEXING OF CONTENTS
1
Anália Lourenço Augusto Ramoa Florentino Fdez-Riverola
Gael Pérez Rodríguez Jorge Condeço Martín Pérez Pérez
><HOW CAN BIOTM HELP THE NOITFY LIBRARY
Introduction to Biomedical Text Mining
Main tasks/methods and common applications Means to present results/information to users
01 02 03
Contents
3
Live examples of the workflow
Initial ideas Some experiments Discussing utility and available resources
Final brainstorming
common interests collaboration possibilities
><
01Introduction to Biomedical Text
Mining
Text mining refers to the process of deriving high-quality information from text. Typically, natural language processing (NLP), machine learning (data mining) and statistical analysis approaches are used to devise patterns and trends.
Two main areas: Information Retrieval (IR) supports large-scale processes for many general and domian-focused search engines. Information Extraction (IE) has a long experience in processing common English texts.
Biomedical text mining (BioTM) emerged as a “new” domain specialised in the processing of biomedical documents.
Mainly scientific literature, but also patents and other documents of biomedical relevance. IE is focused on entities important to biomedical domains (e.g. genes, proteins and diseases) and is supported by the numerous biomedical vocabularies (e.g. MESH terms, Gene ontology).
Historical perspective
4
><
Providing domain-tailored tools for experts: automatic screening (retrieval + ranking) relevant documents; semi-automatic annotation of domain-relevant concepts; integration and validation (e.g. database contents).
Enhanced analysis abilities: creation of domain-specific search filters; intuitive and visual attractive presentation of document main contents; interactive and user customised navigation of document contents.
HOW CAN BIOTM HELP THE NOITFY LIBRARY
Two major goals of Biomedical Text Mining
5
><GENERAL VIEW OF OUR BIOTM WORKFLOW 6
1 2
Automatic retrieval of
public literature
3
5
4
Learning from expert’s experience
Automatic document
ranking
Interactive and semantic
searches
Manual expert
curation
><
Any library that enables automatic literature retrieval can be integrated in our workflow.
PubMed is perhaps the most accessed public database of biomedical bibliography. The NCBI (National Center for Biotechnology Information) Entrez Utilities Web services are used to automatically access, search for potentially relevant articles, and download publications.
PubMed searches can be made as specific or relaxed as desired. That is, one may choose to maximise the chance of getting relevant document (true positives) even if it means also retrieving more irrelevant documents (false positives). Document relevance will be further evaluated.
GENERAL VIEW OF OUR BIOTM WORKFLOW
Document retrieval from public sources
7
><GENERAL VIEW OF OUR BIOTM WORKFLOW
Document retrieval from public sources
8
PubMed vs Ovid (2012-2015) PubMed 2016 results
05
101520253035404550
2012 2013 2014 2015 2016
#Docum
ents
Year
PubMed Ovid
><
The documents retrieved from PubMed are automatically processed and the predictive ability of the contents is evaluated. Relevance mining model aims to score the retrieved documents and minimise the time that experts spend discarding non relevant documents.
We look for unigrams, bigrams, and co-occurring words features as well as meaningful mentions of domain-specific concepts. The various textual and entity features are integrated in a variable trigonometric threshold linear classifier, which defines a decision surface, where those feature terms with better predictive ability appear differentiated.
Experts help training the relevance mining model by providing a set of good examples of relevant and irrelevant documents from where the model can be inferred (“learn by example”).
GENERAL VIEW OF OUR BIOTM WORKFLOW
Document triage
9
><
Semantic annotations are important to both triage and document search/indexing. Key domain concepts may help to quickly determine document relevance. The annotation of meaningful domain concepts can be used to:
provide contextualised document summary, implement customised search filters, enable systematic and interactive document navigation.
Named entity recognition methods are used to identify mentions of important domain concepts. Automatic recognition can be accomplished by using dictionaries (and other domain lexica) as well as natural language processing and machine learning entity taggers. We explored public NER tools and the terminology available at Notify site to implement a first example of the NER module. We covered species, organs, blood, cells, and tissues.
GENERAL VIEW OF OUR BIOTM WORKFLOW
Semantic annotation
10
><
Experts establish the key guidelines for automating (any!) text processing. Experts validate the relevance mining model. Experts label a meaningful set of examples of relevant and irrelevant documents from where the mining model can be inferred (“learn by example”). Experts validate the outputs of the model until considered acceptable. Experts revise/make semantic annotations. Define the main contents to be extracted from the documents (“information that is really important”). Ensure that semantic annotations are correct, i.e. the concept is properly marked and classified (“domain classification”).
GENERAL VIEW OF OUR BIOTM WORKFLOW
Expert curation
11
><GENERAL VIEW OF OUR BIOTM WORKFLOW
MARKYT
12
Domain experts define/revise concepts and establish document relevance
Experts make sense of the information scattered in text
Good quality annotations have
many uses
http://sing.ei.uvigo.es/privateMarky/
><
Today, Web-based search interfaces are very popular. Yet, presenting large, growing volumes of information in a systematic, intuitive and interactive way is challenging. We propose a search and visualisation interface that is customised for the domain. This interface allows users to formulate queries at different levels of specificity and using domain concepts. Also, network representation offers an intuitive and visually appealing means to observe and navigate a potentially large number of relationships (contents of documents that meet the search). Network topology provides descriptive statistics about the documents and concepts matching the user query and enables the inference of indirect associations.
GENERAL VIEW OF OUR BIOTM WORKFLOW
Search and visualisation
13
><GENERAL VIEW OF OUR BIOTM WORKFLOW 14
Documents association:
looking for similar evidences
Search and visualisation
><GENERAL VIEW OF OUR BIOTM WORKFLOW 15
Concept-driven visualisation and
navigation
Search and visualisation
><
02Notify Library
Where can BioTM help?
Examples of public online applications supported by our BioTM approach. Domain-specific layout and search functionalities, but common core document processing workflow.
Live examples of the workflow
16
HOW CAN BIOTM HELP THE NOITFY LIBRARY
Live Applications http://sing.ei.uvigo.es/antimicrobialCombination/
HOW CAN BIOTM HELP THE NOITFY LIBRARY
Graph visualisation
HOW CAN BIOTM HELP THE NOITFY LIBRARY
Interrelating different info
><
03Final Brainstorming
How can this cooperation evolve?
21
These are only preliminary experiments on what can be achieved with BioTM. Expert input about needs and useful outputs is critical. Also, it is critical to identify high quality vocabulary suitable for automatic concept annotation.
><HOW CAN BIOTM HELP THE NOITFY LIBRARY
A future in common
22
For the Notify Library: integrate the contents of other public libraries; reduce the workload of experts at triage stage; provide additional search tools and new visualisation abilities (better user experience); Expand the coverage of Notify Library at a reduced (human) cost.
For the SING Research Group: be able to interact with domain experts and understand critical needs of these libraries; implement a BioTM workflow for a real-world and routinely used Medical application; co-publishing with Notify group the lessons learned throughout the process.