Biomedical Text Mining: A survey - Harsh...

26
Biomedical Text Mining: A survey Harsh Thakkar Ph.D. –I 201321008 [email protected]

Transcript of Biomedical Text Mining: A survey - Harsh...

Page 1: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Biomedical Text Mining: A survey

Harsh Thakkar

Ph.D. –I

201321008

[email protected]

Page 2: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Ala carte`

• What?

• Why?

• How?

• Resources & Tools

• Bibliography

Page 3: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

What?

• Definition/(s)!!

:O

Page 4: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Why?

>10 km

Source: Lars jensen

Page 5: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Why?

Analysis

Interpretation

Source: Prof. Prasnjit Majumder

• And its growing at a pace we cannot cope up with! Every minute, day, year • Its exponential!!

Page 6: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

How?

Source: Prof. Prasnjit Majumder

Page 7: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Lets see !

Page 8: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Major Data Mining Tasks

• In BDM

1. Discovery of new facts

2. Document summarization

3. Question Answering

• I.R. techniques are dominant

– Biomedical domain specific techniques (W. Hersh, 2005)

#W. Hersh. Information Retrieval: A Health and Biomedical Perspective. Health Informatics. Springer, third edition, 2005.

Page 9: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Clustering Techniques

Major Data Mining Tasks

Retrieval and reduction Classify/Mine Knowledge

Page 10: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Document Summarization

• Contextual abstraction of information from multiple texts, also know as a text reduction problem

• Information Extraction (IE)

– NER

• Most commonly used and effective tehnique

• In biomedical context, entities like genes and protein interaction, diseases & treaments, drug names & dosages (U. leser et al., 2005)

#U. Leser and J. Hakenberg. What makes a gene name? named entity recognition in the biomedical literature.Briefings in Bioinformatics, 6(4):357–369, 2005

Page 11: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Why NER?

– This field is ever growing, research never stops -> huge and huge amount of data -> new synonyms( A. Yeh et al, 2005)

– Heart attack – myocardial infection

– It becomes difficult with comprehensive synonymity system to integrate knowledge from multiple sources (so,UMLS Metathesauras or Gene Ontology)

A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman. BioCreAtIvE task 1A: Gene mention finding evaluation. BMC Bioinformatics, 6(Suppl 1):S2, 2005.

Page 12: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Extensive use of domain specific abr.

– E.g. R.A.

– “right atrium”, “rheumatoid arthritis”, “renal artery”, “refractory anemia”, etc. (S. Pakhomov et al. 2002)

#S. Pakhomov. Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 160–167, 2002.

Page 13: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Aka Entity Normalization

Named Entity Recognition (NER)

Determine entity substring and

boundaries

Assigning entities to defined class

Entity mapping; i.e. selecting a preferred

Unique id for the selected entity

• Generally discussed as a single task, NER is a 3 step process.

BioCreAtivE ® (L. Hirschman et al. , 2005)

Page 14: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• System performance of IE (NER based systems)

• Takes into account F-score, precision, recall

• Task based evaluations – I2b2 : task for providing clinical data for research

purposes. Current projects: autoimmune diseases, diabetes, obesity, etc [driving biology projects-DBP’s]

– BioNLP: conducts shared tasks globally targeted towards the following tasks

Page 15: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• [GE] Genia Event Extraction for NFkB knowledge base construction • [CG] Cancer Genetics • [PC] Pathway Curation • [GRO] Corpus Annotation with Gene Regulation Ontology • [GRN] Gene Regulation Network in Bacteria • [BB] Bacteria Biotopes (semantic annotation by an ontology)

• NER systems have the f-scores of 0.83 & 0.87 for the first (L. Smith et al, 2008) and second (A. Yeh et al, 2005) BioCreAtivE gene media tasks.

• 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011) • 0.73 for JNLPBA bio-entity recognition tasks 2004 (J. Kim et al.,

2004) • 0.57 for BioNLP 2013 shared task bacteria biotopes (IRISA) [Institute

for research in computer science and random systems]

Page 16: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Other approaches

– Dictionary based

• Issues : spelling mistakes, morphological variants, homonymy(M. Krauthammer et al., 2004).

• Overcomes: string matching techniques, either exact or partial (Y. Tsuruoka et al., 2003), (J. Tsujii et al., 2003)

– Rule-based

• Define pattern rules (as in DNA sequence)

• E.g. EMPathiE and PASTA (K. Humphreys et al., 2000),(R. Gaizauskas et al., 2003)

Page 17: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Classification based – Naïve bayes (K. Takeuchi et al., 2005)

– SVM’s (J. Kazama et al., 2002),(T. Mitsumori et al., 2005),(C. Nobata et al., 1999),(K. Yamamoto et al., 2003)

– BIO Tagging scheme; individual tokens are tagged • B- beginning of entity

• I- inside entity

• O- outside entity

• ISSUES: when boundaries overlap

Page 18: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Question & Answering:

• Unlike general Q & A, – Domain specific quering resulting in crisp and

precise answers

• Different from other systems as – Limited scope of questions

– Crisp knowledge

• Currently drawing attention of researchers (Y. Hu et al., 2005) (S. Athenikos et al., 2010)

Page 19: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Resources

• MEDLINE

– One of the most important resources in biomedical domain for mining

– Stack of bibliographic material on bio-medicine from 1946 to 2013 and onwards

– Aka PubMed

– www.ncbi.nlm.nih.gov/pubmed

Page 20: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• Source: http://www.ncbi.nlm.nih.gov/pubmed/?term=lactobacilus

Page 21: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• OSHUMED – consisting of 348,566 references (out of a total of

over 7 million), covering all references from 270 medical journals over a five-year period (1987-1991), published data of over 5 years#

+

– more data from TREC Genomes track from 1994-2003##

– Cross references PubMed, clinically oriented subset of MEDLINE

#TREC-9 filtering track collections.http://trec.nist.gov/data/t9_filtering.html ##TREC genomics track data. http://ir.ohsu.edu/genomics/data.html.

Page 22: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Tools

• BioCreAtivE ® – L. Hirschman, M. Colosimo, A. Morgan, and A. Yeh. Overview of BioCreAtIvE task 1B:

Normalized gene lists.BMC Bioinformatics, 6(Suppl 1):S11, 2005.

• Metamap (MMTx) - NLM

– http://mmtx.nlm.nih.gov/

• Negex, Context – University of Pittsburg – BluLab

– http://www.dbmi.pitt.edu/blulab/index.html

• Ctakes – Mayo Clinic

– https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/OHNLP_Documentation_and_Downloads

Page 23: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Bibliography

• U. Leser and J. Hakenberg. What makes a gene name? namedentity recognition in the Biomedical literature.Briefings in Bioinformatics, 6(4):357–369, 2005.

• L. Smith, L. Tanabe, R. Johnson nee Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. Struble, R. Povinelli, A. Vlachos, W. Baumgartner, L. Hunter, B. Carpenter, R. Tzong-Han Tsai, H.-J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. ManaLopez, J. Mata, and W. Wilbur. Overview of BioCreAtIve II: Gene mention recognition.Genome Biology, 9(Suppl 2):S2, 2008.

• O. Uzuner, B. R. South, S. Shen, and S. L. DuVall. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Association, 18(5):552–556, 2011.

• J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. InProceedings of the International Joint workshop on Natural LanguageProcessing in Biomedicine and its Applications, pages 70–75, 2004.

Page 24: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• M. Krauthammer and G. Nenadic. Term identification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512–526, 2004.

• Y. Tsuruoka and J. Tsujii. Boosting precision and recall of dictionary-based protein name recognition. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine - Volume 13, pages 41–48, 2003.

• Y. Tsuruoka and J. Tsujii. Probabilistic term variant generator

• for biomedical terms. In Proceedings of the26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 167–173, 2003.

• K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science yournal articles: Enzyme interactions and protein structures. InPacific Symposium on Biocomputing, pages 502–513, 2000.

• R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett. Protein structures and information extraction from biological texts: The PASTA system.Bioinformatics, 19(1):135–143, 2003.

Page 25: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

• J. Kazama, T. Makino, Y. Ohta, and J. Tsujii. Tuning support vector machines for biomedical named entity recognition. InProceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain - Volume 3, pages 1–8,

2002. • K. Yamamoto, T. Kudo, A. Konagaya, and Y. Matsumoto. Protein name tagging for

biomedical annotation in text. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine - Volume 13, pages 65–72, 2003.

• H. Yu, C. Sable, and H. Zhu. Classifying medical questions based on an evidence taxonomy. InProceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, 2005.

Page 26: Biomedical Text Mining: A survey - Harsh Thakkarharshthakkar.in/wp-content/uploads/2015/07/Biomedical...tasks. • 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011)

Thank You