Utilizing literature for biological discovery

Using Literature for Biological Discovery

Lars Juhl JensenEMBL Heidelberg

Introduction

• Why literature mining should not be used on its own– Biological discoveries are not made by reading papers– To make biological discoveries, existing scientific literature

generally has to be used in combination with other data sources– An example of how this can be done is the Genes2Diseases

server

• Using NLP for interpreting high-throughput experiments– Many types of genomics scale data sets are available today,

including data on gene expression and protein-protein interactions– To make discoveries, these data must be analyzed in the context

of what is already known– NLP can be used for obtaining this information from literature– EMBL and EML are currently finalizing a method for extracting

gene regulatory interactions from Medline abstracts

Genes2Diseases: utilizing Medline for finding disease related genes in the human genome

• Each disease is associated with a phenotypic MeSH term and mapped to a chromosomal region using LocusLink

• Within the region, gene functions are assigned by sequence similarity• Gene functions are linked to chemical substances via RefSeq entries• Chemical substances are linked to phenotypes by Medline abstracts• A score of each gene’s relevance for the disease is calculated

http://www.bork.embl.de/g2d

“Biologists would rather share their toothbrush than share a gene name”

• Lists of synonymous identifiers and names were compiled from– SWISS-PROT/TrEMBL

– SGD, WormBase, and FlyBase

– BLAST search against UniGene

• Several types of identifiers– Various database identifiers

and accession numbers

– Gene symbols and gene names

• Lack of standardization– 8+ identifiers per yeast gene

– Many names refer to unrelated genes in different species

The synonyms and orthologs lists can be downloaded from:http://www.bork.embl.de/synonyms

Retraining TreeTagger for Medline abstracts

• The English parameter file distributed with TreeTagger was trained on the UPenn Treebank

• We retrained TreeTagger on the manually annotated GENIA 3.0 corpus (466,179 tokens) adding gene names to the dictionary

• Performance of the two taggers was evaluated on 55,166 tokes not used during training

• Retraining eliminated more than half of all tagging errors

Tagging is really easy ... compared to extracting the information you are after

• Many ways to write the same thing– A activates the transcription of B– B transcription is induced by A– A is a transcriptional activator of B– Overexpression of A increases B mRNA levels– Transcription is enhanced when A binds to the B promoter– The B promoter contains an A UAS

• Multiple pieces of information and negations in a sentence– A is a transcriptional activator of B, C, D, E, and F– B was not suppressed by A– The A transcription factor affects B but not C– C phosphorylation of A leads to increased expression of B

“Biologists tend to ask simple questions: Here’s a frog ... is he happy?”

• It is not always clear what a sentence means– Many biological terms/concepts are poorly defined– Words are often coined before a subject is understood– Ambiguous use of terms makes text mining more difficult

• The complexity of biological systems makes it hard to simple experiments that lead to clear answers– “Protein A regulates the expression of gene B”

• Does this mean that protein A is a transcription factor?• Or are more indirect regulatory mechanisms allowed?

– “Protein A is a transcriptional activator of B”• Can A activate transcription alone?• Or only together with certain other proteins?

A mini-ontology of transcription regulation

Entities (boxes)• generic (gray)• regulator (yellow)• activator (red)• repressor (green)• target (blue)

Relations (arrows)• is-a (black)• part-of (blue)

Events (arrows)• creates (green)• binds (red)

Gene

TranscriptGene

product

StableRNA

Promoter

Upstreamregulatoryelement

Upstreamactivatingsequence

Upstreamrepressingsequence

mRNA

Protein

Transcriptionregulator

Transcriptionactivator

Transcriptionrepressor

Codingsequence

rRNA tRNA

Parsing abstracts to identify relationships between genes/proteins

• Sentence and word boundaries are identified using Tokenizer

• Our retrained TreeTagger is used for tagging part-of-speech

• Abstracts are chunked with a custom CASS grammar to identify noun and verb chunks

• Noun chunks are categorized according to a mini-ontology

• Lexico-syntactic patterns are used to identify event chunks

SRN1 NNPG NXPGSG EVSUPVAcan MD |suppress SUPV |rna2 NNPG NXPGPL |rna3 NNPG | |rna4 NNPG | |rna5 NNPG | |rna6 NNPG | |and CC | |rna8 NNPG | |singly RBor CCin INpairs NNS

Using text mining of Medline abstract to support predicted regulatory interactions

• By applying the scheme just described to all Medline abstracts, a set of regulatory interactions in multiple species is obtained

• We will use it to classify protein associations derived from– Microarray gene expression– Chromatin IP data– Physical protein interaction screens (e.g. Y2H and TAP)– Cross-species analysis of genomic context (STRING)

• To integrate all of these different data sources the list of synonymous gene names and identifiers is again needed as different data sets use different identifiers

The next step: mining full text scientific papers

• Full text versions of papers from several journals are available in formats suitable for text mining

• It matters which part of a paper a sentence is from– The abstract has the highest

density of descriptive words– It is followed by the introduction

and then the discussion– The methods section is

qualitatively different– Interestingly the results section

has the lowest density ...

• Our NLP scheme should work on full text papers too

Summary

• I believe literature mining is a powerful tool for studying biology, but it should never be used alone

• Literature mining is much needed to help interpret the massive amounts of data from genomics-scale studies

• We have developed a method for extracting information on gene regulation from Medline abstracts using NLP

• The same methods should be applicable to full text papers, particularly for the introduction and discussion parts

Acknowledgments

• European Media Laboratory GmbH (EML)– Jasmin Saric

– Isabel Rojas

• European Molecular Biology Laboratory (EMBL)– Miguel Andrade

– Carolina Perez-Iratxeta

– Parantu Shah

– Peer Bork

• Publications– C. Perez-Iratxeta, P. Bork and

M. A. Andrade, Nature Genetics, 31:316-319, 2002

– P. Shah, C. Perez-Iratxeta,P. Bork and M. A. Andrade, BMC Bioinformatics, 4:20, 2003

• Web resources– www.bork.embl.de/g2d

– www.bork.embl.de/synonyms

Utilizing literature for biological discovery

Technology

Transcript of Utilizing literature for biological discovery