Image Mining from Gel Diagrams in Biomedical Publications

19
Image Mining from Gel Diagrams in Biomedical Publications Tobias Kuhn and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine 5th International Symposium on Semantic Mining in Biomedicine (SMBM) 3 September 2012 Zurich, Switzerland

description

(CC Attribution License does not apply to included third-party material on slides 5 and 17; see the paper for the references: http://www.tkuhn.ch/pub/kuhn2012smbm.pdf )

Transcript of Image Mining from Gel Diagrams in Biomedical Publications

Page 1: Image Mining from Gel Diagrams in Biomedical Publications

Image Mining from Gel Diagrams inBiomedical Publications

Tobias Kuhn and Michael Krauthammer

Krauthammer Lab, Department of PathologyYale University School of Medicine

5th International Symposium onSemantic Mining in Biomedicine (SMBM)

3 September 2012Zurich, Switzerland

Page 2: Image Mining from Gel Diagrams in Biomedical Publications

Introduction

The inclusion of figure images is a recent trend in the area ofliterature mining.

The increasing amount of open access publications makes suchimages available for automated analysis.

Image mining techniques can be used for image search interfaces,for relation mining, and to complement text mining approaches.

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19

Page 3: Image Mining from Gel Diagrams in Biomedical Publications

Yale Image Finder

http://krauthammerlab.med.yale.edu/imagefinder/

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19

Page 4: Image Mining from Gel Diagrams in Biomedical Publications

Gel Images

Our approach focuses on gel images:

• They are the result of gel electrophoresis (e.g. Southern,Western and Northern blotting)

• They are often shown in biomedical publication as evidence forthe discussed findings (e.g. protein-protein interactions andprotein expressions under different conditions)

• About 15% of all subfigures are gel images

• They are structured according to common regular patterns

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19

Page 5: Image Mining from Gel Diagrams in Biomedical Publications

Relations from Gel Images

Condition Measurement ResultMDA-MB-231 14-3-3σ high expressionNHEM 14-3-3σ no expressionC8161.9 14-3-3σ high expressionLOX 14-3-3σ low expressionMDA-MB-231 β-actin high expressionNHEM β-actin high expressionC8161.9 β-actin high expressionLOX β-actin high expression

Condition Measurement ResultIL-1β (–) DEX (–) RU486 (–) p-p38 low expressionIL-1β (+) DEX (–) RU486 (–) p-p38 high expressionIL-1β (–) DEX (+) RU486 (–) p-p38 no expressionIL-1β (+) DEX (+) RU486 (–) p-p38 low expressionIL-1β (–) DEX (–) RU486 (+) p-p38 no expressionIL-1β (+) DEX (–) RU486 (+) p-p38 high expressionIL-1β (–) DEX (+) RU486 (+) p-p38 low expressionIL-1β (+) DEX (+) RU486 (+) p-p38 high expression... ... ...

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19

Page 6: Image Mining from Gel Diagrams in Biomedical Publications

Image Mining Processes

In principle, image mining involves the same processes as classicalliterature mining1 (with some subtle but important differences):

• Document categorization (image categorization has to dealwith the two-dimensional space of pixels, instead of text)

• Named entity tagging (pinpointing the mention of an entity ismore difficult with images; OCR errors have to be considered)

• Fact extraction (analysis of graphical elements instead ofparsing complete sentences)

• Collection-wide analysis

1Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.

International Journal of Medical Informatics, 67(1-3):7–18.

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19

Page 7: Image Mining from Gel Diagrams in Biomedical Publications

Procedure

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

A BX

Y

P

articles figures segments text gels gel panels named entities

1 21 3 4 5 6

relations

7

1 Figure Extraction

2 Segmentation

3 Text Recognition

4 Gel Segment Detection

5 Gel Panel Detection

6 Named Entity Recognition

7 Relation Extraction

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19

Page 8: Image Mining from Gel Diagrams in Biomedical Publications

Figure Extraction

A BX

Y

P

A BX

Y

P

articles figures

11

We use structured XML files of the open access subset of PubMedCentral.

(Figure extraction from PDF files or even bitmaps of scanned articleswould be more difficult, but definitely feasible.)

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19

Page 9: Image Mining from Gel Diagrams in Biomedical Publications

Segmentation and Text Recognition

A BX

Y

P

A BX

Y

P

segments text

2 3

For segmentation and text recognition we rely on our previous work.2

This includes:

• Detection of layout elements

• Text region detection

• OCR (using the Microsoft Document Imaging package of MSOffice)

2Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for

biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text regiondetection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19

Page 10: Image Mining from Gel Diagrams in Biomedical Publications

Gel Segment Detection

A BX

Y

P

gels

4

Random forest classifiers (based on 75 random trees) on the followingfeatures of image segments:

• coordinates of the relative position within the image

• relative and absolute width and height

• 16 grayscale histogram features

• color features: red, green and blue

• 13 texture features

• number of recognized characters

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19

Page 11: Image Mining from Gel Diagrams in Biomedical Publications

Gel Segment Detection Results

Manually annotated training and testing sets of 500 random figureseach.

Results for three different thresholds:

Threshold Precision Recall F-score

high recall 0.15 0.439 0.909 0.5920.30 0.765 0.739 0.752

high precision 0.60 0.926 0.301 0.455

Accuracy (area under ROC curve): 98.0%

Unbalanced set: 3% gel segments vs. 97% non-gel segments

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19

Page 12: Image Mining from Gel Diagrams in Biomedical Publications

Gel Panel Detection

A BX

Y

P

gel panels

5

Algorithm:

• Start with a gel segment according to the high-precision classifier

• Repeatedly look for adjacent gel segments according to thehigh-recall classifier, and merge them

• Collect labels in the form of text segments arround the detectedgel region

Results on another set of 500 manually annotated figures:

Precision Recall F-score

0.951 0.379 0.542

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19

Page 13: Image Mining from Gel Diagrams in Biomedical Publications

Named Entity Recognition

named entities

6

Detection of gene and protein names in gel labels:

• Tokenization of gel label texts

• Lookup in Entrez Gene database

• Case-sensitive matching

• Exclude tokens:• Less than 3 characters• Arabic or Latin numbers• Common short words (from a list of the 100 most frequent words

in biomedical articles)• 22 general words frequently used in gel diagrams (e.g. min, hrs,

line, type, protein, DNA)

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19

Page 14: Image Mining from Gel Diagrams in Biomedical Publications

Named Entity Recognition Results

Recognized gene/protein tokens in 2000 random figures:

absolute relative

Total 156 100.0%Incorrect 54 34.6%– Not mentioned (OCR errors) 28 17.9%– Not references to genes or proteins 26 16.7%

Correct 102 65.3%– Partially correct (could be more specific) 14 9.0%– Fully correct 88 56.4%

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19

Page 15: Image Mining from Gel Diagrams in Biomedical Publications

Relation Extraction

relations

7

Relation extraction is future work and we do not have concreteresults at this point.

It would involve the following steps:

• Gene/protein name disambiguation

• Identify semantic roles (condition, measurement, ...)

• Quantify degree of expression

Combination with classical text mining techniques seems promising.

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19

Page 16: Image Mining from Gel Diagrams in Biomedical Publications

Overall Results on PubMed Central

We ran our pipeline on the whole open access subset of PubMedCentral:

Total articles 410 950Processed articles 386 428Total figures from processed articles 1 110 643Processed figures 884 152Detected gel panels 85 942Detected gel panels per figure 0.097Detected gel labels 309 340Detected gel labels per panel 3.599

Detected gene tokens 1 854 609Detected gene tokens in gel labels 75 610Gene token ratio 0.033Gene token ratio in gel labels 0.068

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19

Page 17: Image Mining from Gel Diagrams in Biomedical Publications

Discussion: Standardized Biomedical Diagrams?

It seems feasible to extract relations from gel images at satisfactoryaccuracy, but it is clear that this procedure is far from perfect.

Shouldn’t we standardize biomedical diagrams? A UnifiedModeling Language (UML) for biomedicine?

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19

Page 18: Image Mining from Gel Diagrams in Biomedical Publications

Conclusions and Future Work

Conclusions:

• Gel segments can be detected with high accuracy

• Detection of gel panels at high precision

• Gene/protein name recognition in gel labels at satisfactoryprecision

→ Image mining from gel diagrams is feasible

Future Work:

• Relation extraction

• Combination with classical text mining techniques

• Other named entity types: cell lines, drugs, ...

• Standard for biomedical diagrams?

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19

Page 19: Image Mining from Gel Diagrams in Biomedical Publications

Thank you for your Attention!

Questions?

T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19