BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F....

18
BioRAT: Extracting Biologic al Information from Full-le ngth Papers David P.A. Corney, Bernard F. Buxton, Willia m B. Langdon and David T. Jones Bioinformatics Unit, Department of Computer Science, University College London, UK (Bioinformatics, Vol. 20, no. 17, p.32 06-3213)

Transcript of BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F....

Page 1: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

BioRAT: Extracting Biological Information from Full-length Papers

David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones

Bioinformatics Unit, Department of Computer Science, University College London, UK(Bioinformatics, Vol. 20, no. 17, p.3206-3213)

Page 2: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

2/18

Abstract BioRAT

Biological Research Assistant for Text mining A new IE tool, specifically designed to perform biomedical IE. It is able to locate and analyze both abstracts and full-length papers.

Less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper.

BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.06% recall with 51.25% precision on full-length papers.

Page 3: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

3/18

1. Introduction (1/2) IR helps researchers to find papers, but it still

leaves a large amount of reading to be done. IE goes one stage further, and analyzes the pa

pers on behalf of the researcher. BioRAT is given a query and, autonomously,

finds a set of papers, reads them and highlights the most relevant facts in each.

Page 4: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

4/18

1. Introduction (2/2) BioRAT uses NLP techniques and domain-sp

ecific knowledge to search for patterns in documents, with the aim of identifying interesting facts. These facts can then be extracted to produce a database of information, which has a higher ‘information density’ than a pile of papers.

Page 5: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

5/18

2. System Outline The user enters a query into BioRAT, which is then

passed on to PubMed. The user is presented with a list of papers, from whic

h they can choose to download abstracts or full-length papers.

The user can apply some pre-existing templates or create their own. In either case, the templates match patterns in the text that contains ‘useful’ information, which is extracted for display to the user and for possible incorporation into a database.

Page 6: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

6/18

Page 7: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

7/18

Page 8: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

8/18

Page 9: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

9/18

2.1 Web Spidering BioRAT automatically locates and acquires full-length

paper wherever paossible, instead of just using abstracts. It does this via the Internet, by following a series of hyp

erlinks to find each target paper. To ensure that the correct paper has been identified, and

that the text conversion process has succeeded, the first part of the plain text file is compared with the corresponding abstract obtained directly from PubMed, using a fuzzy string matching routine.

BioRAT only attempts to locate and download PDF papers.

Page 10: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

10/18

2.2 IE Engine IE engine is based on the GATE toolbox

(General Architecture for Text Engineering, http://gate.ac.uk/).

Gate is a general purpose text engineering system, whose modular and flexible design allows us to use it to create a more specialized biological IE system.

Two components of GATE that must be modified for the domain-specific application are gazetteers and templates.

Page 11: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

11/18

2.2.1 Gazetteers One task in IE is ‘named entity recognition’, whic

h aims to identify key items within text. BioRAT incorporates gazetteers from three source

s, namely MeSH, Swiss-Prot and hand-made lists. Two gazetteers were created by hand.

One consisted of 30 words describing the interaction of proteins (e.g. ‘bind’, ‘down-regulate’, ‘interact’ etc).

The other consisted of a few further synonyms of proteins not already covered by the other gazetteers.

Page 12: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

12/18

2.2.2 Templates A template is a representation of a text

pattern that allows us to extract information automatically. It consists of a number of predefined slots to be filled by the system from information contained in the text.

‘Genetic evidence for the interaction of

Pex7p and Pexd14p is provided …’ and extracted from it the interaction (Pex7pPex13p)

‘interaction of’ (PROTEIN_1) ‘and’ (PROTEIN_2)

Page 13: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

13/18

2.3 Template Design Tool A template design tool with a graphical user

interface, which allows non-expert users to develop templates without having to learn a complex new language.

The properties used are: POS tag, gazetteer headings, the word stem, and the word itself.

Page 14: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

14/18

3. Experiments (1/3)

Page 15: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

15/18

3. Experiments (2/3)

Page 16: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

16/18

3. Experiments (3/3)

Page 17: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

17/18

4. Discussion The density of

‘interesting’ facts found in the abstract is much higher than the corresponding density in the full text.

Page 18: BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

18/18

5. Conclusion BioRAT: an information extraction system sp

ecially designed to process biological research papers.

Feature: it uses full-length papers, rather than being limited to abstracts as previous studies have been.

Recall: 20% on the abstract alone. 43% recall and over 50% precision on full-len

gth papers.