Finding and Accessing Diagrams in Biomedical Publications

20
Finding and Accessing Diagrams in Biomedical Publications Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine AMIA 2012 Annual Symposium 6 November 2012 Chicago

description

(CC Attribution License does not apply to included third-party material on slides 3, 6, 12, and 19; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2012amia.pdf )

Transcript of Finding and Accessing Diagrams in Biomedical Publications

Page 1: Finding and Accessing Diagrams in Biomedical Publications

Finding and Accessing Diagrams inBiomedical Publications

Tobias Kuhn, ThaiBinh Luong, and Michael Krauthammer

Krauthammer Lab, Department of PathologyYale University School of Medicine

AMIA 2012 Annual Symposium6 November 2012

Chicago

Page 2: Finding and Accessing Diagrams in Biomedical Publications

Introduction

The inclusion of figure images is a recent trend in the area ofliterature mining.

The increasing amount of open access publications makes suchimages available for automated analysis.

Image mining techniques can be used for image search interfaces,for relation mining, and to complement text mining approaches.

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 2 / 20

Page 3: Finding and Accessing Diagrams in Biomedical Publications

Answer Queries with Images

Often, a query is best answered by an image.

For example, WolframAlpha for “growth age 6”:

Idea: Use existing diagrams of scientific articles to answer queries.

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 3 / 20

Page 4: Finding and Accessing Diagrams in Biomedical Publications

Yale Image Finder

http://krauthammerlab.med.yale.edu/imagefinder/

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 4 / 20

Page 5: Finding and Accessing Diagrams in Biomedical Publications

Detection and Analysis of Specific Image Types

For the next version of the Yale Image Finder, we are working on thedetection and analysis of specific image types:

• Axis Diagrams

• Gel Images

• Network Diagrams (work in progress)

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 5 / 20

Page 6: Finding and Accessing Diagrams in Biomedical Publications

Axis Diagrams: Examples

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 6 / 20

Page 7: Finding and Accessing Diagrams in Biomedical Publications

Axis Diagrams

Axis diagrams are important for several reasons:

• They are abundant in biomedical literature: about 38% of allsubfigures are axis diagrams

• They follow simple common patterns based on axes

• They are complex in the sense that they combine severaldimensions

• They summarize data for human readers

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 7 / 20

Page 8: Finding and Accessing Diagrams in Biomedical Publications

Axis Diagram Detection Steps

Basic Idea: Large segments are detected as center segments of axisdiagrams if surrounded by a number of small label segments.

1. 2. 3. 4. 5.original segments center label result

candidates candidates

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 8 / 20

Page 9: Finding and Accessing Diagrams in Biomedical Publications

Additional Classifiers

To compare and improve our approach, we apply SVM classifiers withthe following two types of features:

• Image: texture and histogram features of the bitmap image

• Caption: word vector of the tokenized caption text

These classifiers only act on the complete figure and cannot spot thelocation of axis diagrams.

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 9 / 20

Page 10: Finding and Accessing Diagrams in Biomedical Publications

Results

Evaluation on a random sample of 100 articles from PubMed Centralwith at least one figure. These 404 figures were manually annotated:they contained 508 axis diagrams.

task method prec

isio

n

reca

ll

F-s

core

detection of figures segments 0.87 0.66 0.75with axis diagrams image 0.66 0.90 0.76

caption 0.84 0.77 0.80image + segments 0.80 0.73 0.76caption + segments 0.90 0.85 0.88image + caption 0.85 0.84 0.84image + caption + segments 0.90 0.89 0.89

extraction of axis segments 0.85 0.40 0.54diagram locations image + segments 0.84 0.39 0.54

caption + segments 0.88 0.39 0.54image + caption + segments 0.89 0.39 0.55

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 10 / 20

Page 11: Finding and Accessing Diagrams in Biomedical Publications

Gel Images

Gel diagrams are another important type of image:

• They are the result of gel electrophoresis (e.g. Southern,Western and Northern blotting)

• They are often shown in biomedical publication as evidence forthe discussed findings (e.g. protein-protein interactions andprotein expressions under different conditions)

• About 15% of all subfigures are gel images

• They are structured according to common regular patterns

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 11 / 20

Page 12: Finding and Accessing Diagrams in Biomedical Publications

Relations from Gel Images

Condition Measurement ResultMDA-MB-231 14-3-3σ high expressionNHEM 14-3-3σ no expressionC8161.9 14-3-3σ high expressionLOX 14-3-3σ low expressionMDA-MB-231 β-actin high expressionNHEM β-actin high expressionC8161.9 β-actin high expressionLOX β-actin high expression

Condition Measurement ResultIL-1β (–) DEX (–) RU486 (–) p-p38 low expressionIL-1β (+) DEX (–) RU486 (–) p-p38 high expressionIL-1β (–) DEX (+) RU486 (–) p-p38 no expressionIL-1β (+) DEX (+) RU486 (–) p-p38 low expressionIL-1β (–) DEX (–) RU486 (+) p-p38 no expressionIL-1β (+) DEX (–) RU486 (+) p-p38 high expressionIL-1β (–) DEX (+) RU486 (+) p-p38 low expressionIL-1β (+) DEX (+) RU486 (+) p-p38 high expression... ... ...

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 12 / 20

Page 13: Finding and Accessing Diagrams in Biomedical Publications

Procedure

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

articles figures segments text

gels gel panels named entities

1 21 3

4 5 6

relations

7

We focus here on the steps 4, 5, and 6. Steps 1, 2, and 3 have beenaddressed in prior work. Step 7 is future work.

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 13 / 20

Page 14: Finding and Accessing Diagrams in Biomedical Publications

Gel Segment Detection

A BX

Y

P

gels

4

Random forest classifiers on a number of features of image segments(position, size, grayscale histogram, color, texture, and number ofrecognized characters).

Results on 1000 manually annotated, random figures:

Threshold Precision Recall F-score AUC

high recall 0.15 0.439 0.909 0.5920.30 0.765 0.739 0.752 0.980

high precision 0.60 0.926 0.301 0.455

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 14 / 20

Page 15: Finding and Accessing Diagrams in Biomedical Publications

Gel Panel Detection

A BX

Y

P

gel panels

5

Algorithm:

• Start with a gel segment according to the high-precision classifier

• Repeatedly look for adjacent gel segments according to thehigh-recall classifier, and merge them

• Collect labels in the form of text segments arround the detectedgel region

Results on another set of 500 manually annotated figures:

Precision Recall F-score

0.951 0.379 0.542

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 15 / 20

Page 16: Finding and Accessing Diagrams in Biomedical Publications

Named Entity Recognition

named entities

6

Detection of gene and protein names in gel labels from a sampleof 2000 random figures (tokenization; case-sensitive Entrez Genelookup; exclude very short and very common words):

absolute relativeTotal 156 100.0%Incorrect 54 34.6%– Not mentioned (OCR errors) 28 17.9%– Not references to genes or proteins 26 16.7%Correct 102 65.3%– Partially correct (could be more specific) 14 9.0%– Fully correct 88 56.4%

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 16 / 20

Page 17: Finding and Accessing Diagrams in Biomedical Publications

Overall Results on PubMed Central

We ran our pipeline on the whole open access subset of PubMedCentral:

Total articles 410 950Processed articles 386 428Total figures from processed articles 1 110 643Processed figures 884 152Detected gel panels 85 942Detected gel panels per figure 0.097

Detected gene tokens 1 854 609Detected gene tokens in gel labels 75 610

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 17 / 20

Page 18: Finding and Accessing Diagrams in Biomedical Publications

Conclusions and Future Work

Conclusions

• The location of certain diagram types like axis and gel diagramscan be extracted at a high precision of about 90% with anf-score around 55%

Future Work

• Relation extraction

• Include other image types like network diagrams

• Combination with classical text mining techniques

• Detection of other named entity types: cell lines, drugs, ...

• Sophisticated diagram search interface

• Standard for biomedical diagrams?

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 18 / 20

Page 19: Finding and Accessing Diagrams in Biomedical Publications

Discussion: Standardized Biomedical Diagrams?

It seems feasible to extract relations from gel images at satisfactoryaccuracy, but it is clear that this procedure is far from perfect.

Do we need a standard for biomedical diagrams? A UnifiedModeling Language (UML) for biology and medicine?

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 19 / 20

Page 20: Finding and Accessing Diagrams in Biomedical Publications

Thank you for your Attention!

Questions?

T. Kuhn, T. Luong, and M. Krauthammer, Yale University Finding and Accessing Diagrams in Biomedical Publications 20 / 20