Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

23
An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo USP NLP Group and UFSCar Database Group, São Carlos, BR

description

An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems. Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo. USP NLP Group and UFSCar Database Group, São Carlos, BR. - PowerPoint PPT Presentation

Transcript of Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

Page 1: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis in Biomedical Domain:

Information Extraction for Decision Support Systems

Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

presented by Thiago Pardo

USP NLP Group and UFSCar Database Group, São Carlos, BR

Page 2: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation A lot of electronic documents that report experiments

treatment adopted patients with some kind of disease number of patients enrolled in the treatment symptoms and risk factors positive and negative effects

There are several transactions and journals e.g., American Journal of Hematology, Blood, and Haematologica

06/02/102/22

Page 3: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation Nowadays, researchers and doctors are not able

to process this huge number of documents

06/02/103/22

Page 4: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContext and Motivation These documents are in unstructured format, i.e., in

plain textual form, specially in PDF

There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process

06/02/104/22

Page 5: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brGoal Development of an environment called IEDSS-Bio for

analyzing data of biomedical domain, i.e., Sickle Cell Anemia

Support the expert in making decisions:

Extracting relevant information from biomedical documents

Storing the information in a data warehouse (DW)

Mining interesting knowledge from the DW

06/02/10An Environment for Data Analysis - IEA-

AIE20105/22

Page 6: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brContributions

Theoretical: Domain Knowledge Methodology of Information Extraction

Practical: Resources: collection of documents, dictionary

and rules Tools: Converter, Information Extraction, Data

Warehouse, Data Mining systems

06/02/106/22

Page 7: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

The Environment for Data Analysis

An Environment for Data Analysis - IEA-AIE201006/02/10

How many patients had clinical improvement and were treated with the hydroxyurea drug?

A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression.

7/22

Page 8: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brConverter Module

An Environment for Data Analysis - IEA-AIE201006/02/10

8/22

Page 9: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brConverter Module

An Environment for Data Analysis - IEA-AIE201006/02/10

9/22

Page 10: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brInformation Extraction Module

An Environment for Data Analysis - IEA-AIE201006/02/10

Processed Sections:

Abstract, Results and Discussion (class of positive and negative effects) All Sections (class of patient)

10/22

Page 11: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brSentence Classification

An Environment for Data Analysis - IEA-AIE2010

ML Techniques

Output

Training

Positive Effect

Negative Effect

Others

Test

Several files aboutcomplicationsentences

Several files aboutbenefitsentences

Several files aboutothersentences

New TextTXT

Set of sentences classified into classes

Cla

sses

06/02/1011/22

Page 12: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

Identification of Relevant Information

An Environment for Data Analysis - IEA-AIE201006/02/10

Dictionary

Biomedical Database

12/22

Page 13: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

Identification of Relevant Information

An Environment for Data Analysis - IEA-AIE201006/02/10

Identification of Information Pipeline

Example of Sentences

Relevant Information

Rules

13/22

Page 14: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

Experiments: Sentence Classification

1. How do human beings manually perform the sentence classification?

2. Is it feasible to automate the sentence classification task?

3. What kind of classification algorithm performs better in this task?

An Environment for Data Analysis - IEA-AIE201006/02/10

14/22

Page 15: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

Manual Classification by humans? Annotation Agreement in 50 sentences

An Environment for Data Analysis - IEA-AIE201006/02/10

)(1

)()(

EP

EPAPK

Fleiss (1971)

1

15/22

Page 16: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

It is feasible to automate this task?

Annotator All the classes

3 experts 0.63

3 naïve subjects 0.71

experts + naïve subjects 0.65

An Environment for Data Analysis - IEA-AIE201006/02/10

Agreement ScalePoor Under 0

Slight 0 a 0.2

Fair 0.21 a 0.4Moderate 0.41 a 0.60

Substantial 0.61 a 0.80

Almost Perfect Between 0.81 and 1 Landis e Koch (1977)

2

16/22

Page 17: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

What kind of classification algorithm performs better in this task?

An Environment for Data Analysis - IEA-AIE201006/02/10

Distribution of classes for each sample

3

17/22

Page 18: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

An Environment for Data Analysis - IEA-AIE201006/02/10

Bag-of-words model AVM configuration:

Minimum Frequency = 2 Attributes: 1 to 3-grams

1, for the case the n-gram occurs in the sentence (present); 0 otherwise (absent).

Not considered: stopwords removal and stemming

Sentence Classification Process:training and testing phase

3

18/22

Page 19: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.br

An Environment for Data Analysis - IEA-AIE201006/02/10

Evaluation3

Partitioning method: 10-fold cross-validation

19/22

Page 20: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis - IEA-AIE2010

http://gbd.dc.ufscar.brConclusions

The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being a general environment for mining relevant information in the

biomedical domain

First experiments on sentence classification a step of the whole process very good results (95.9% accuracy) for papers about Sickle Cell

Anemia (SCA)

Task of sentence classification in the SCA domain is well defined and possible to be automated

06/02/1020/22

Page 21: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brFuture Work

Investigate the identification of treatment and symptoms information in scientific papers

Extract of the relevant sentence pieces for populating our databases using IE approaches, e.g., rule-based and dictionary-based

Investigate the use of parallel processing to optimize the more time-consuming tasks, e.g., the application of data mining algorithms and the analytical query processing

Other biomedical areas may also benefit from our text mining approach

An Environment for Data Analysis - IEA-AIE201006/02/10

21/22

Page 22: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis in Biomedical Domain:

Information Extraction for Decision Support Systems

USP NLP Group and UFSCar Database Group, São Carlos, BR

Questions ?

Page 23: Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

http://gbd.dc.ufscar.brReferences

ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003.

FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971.

LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977.

PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at: <http://sca.dc.ufscar.br/download/files/report.sca.pdf>.

An Environment for Data Analysis - IEA-AIE201006/02/10

23/22