Biomedical Text Mining: A survey - Harsh...

Post on 25-Feb-2021

2 views 0 download

Transcript of Biomedical Text Mining: A survey - Harsh...

Biomedical Text Mining: A survey

Harsh Thakkar

Ph.D. –I

201321008

harshbionlp@gmail.com

Ala carte`

• What?

• Why?

• How?

• Resources & Tools

• Bibliography

What?

• Definition/(s)!!

:O

Why?

>10 km

Source: Lars jensen

Why?

Analysis

Interpretation

Source: Prof. Prasnjit Majumder

• And its growing at a pace we cannot cope up with! Every minute, day, year • Its exponential!!

How?

Source: Prof. Prasnjit Majumder

Lets see !

Major Data Mining Tasks

• In BDM

1. Discovery of new facts

2. Document summarization

3. Question Answering

• I.R. techniques are dominant

– Biomedical domain specific techniques (W. Hersh, 2005)

#W. Hersh. Information Retrieval: A Health and Biomedical Perspective. Health Informatics. Springer, third edition, 2005.

• Clustering Techniques

Major Data Mining Tasks

Retrieval and reduction Classify/Mine Knowledge

Document Summarization

• Contextual abstraction of information from multiple texts, also know as a text reduction problem

• Information Extraction (IE)

– NER

• Most commonly used and effective tehnique

• In biomedical context, entities like genes and protein interaction, diseases & treaments, drug names & dosages (U. leser et al., 2005)

#U. Leser and J. Hakenberg. What makes a gene name? named entity recognition in the biomedical literature.Briefings in Bioinformatics, 6(4):357–369, 2005

• Why NER?

– This field is ever growing, research never stops -> huge and huge amount of data -> new synonyms( A. Yeh et al, 2005)

– Heart attack – myocardial infection

– It becomes difficult with comprehensive synonymity system to integrate knowledge from multiple sources (so,UMLS Metathesauras or Gene Ontology)

A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman. BioCreAtIvE task 1A: Gene mention finding evaluation. BMC Bioinformatics, 6(Suppl 1):S2, 2005.

• Extensive use of domain specific abr.

– E.g. R.A.

– “right atrium”, “rheumatoid arthritis”, “renal artery”, “refractory anemia”, etc. (S. Pakhomov et al. 2002)

#S. Pakhomov. Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical texts. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 160–167, 2002.

Aka Entity Normalization

Named Entity Recognition (NER)

Determine entity substring and

boundaries

Assigning entities to defined class

Entity mapping; i.e. selecting a preferred

Unique id for the selected entity

• Generally discussed as a single task, NER is a 3 step process.

BioCreAtivE ® (L. Hirschman et al. , 2005)

• System performance of IE (NER based systems)

• Takes into account F-score, precision, recall

• Task based evaluations – I2b2 : task for providing clinical data for research

purposes. Current projects: autoimmune diseases, diabetes, obesity, etc [driving biology projects-DBP’s]

– BioNLP: conducts shared tasks globally targeted towards the following tasks

• [GE] Genia Event Extraction for NFkB knowledge base construction • [CG] Cancer Genetics • [PC] Pathway Curation • [GRO] Corpus Annotation with Gene Regulation Ontology • [GRN] Gene Regulation Network in Bacteria • [BB] Bacteria Biotopes (semantic annotation by an ontology)

• NER systems have the f-scores of 0.83 & 0.87 for the first (L. Smith et al, 2008) and second (A. Yeh et al, 2005) BioCreAtivE gene media tasks.

• 0.85 for i2b2 tasks concept extraction tasks 2011 (O. Uzuner, 2011) • 0.73 for JNLPBA bio-entity recognition tasks 2004 (J. Kim et al.,

2004) • 0.57 for BioNLP 2013 shared task bacteria biotopes (IRISA) [Institute

for research in computer science and random systems]

• Other approaches

– Dictionary based

• Issues : spelling mistakes, morphological variants, homonymy(M. Krauthammer et al., 2004).

• Overcomes: string matching techniques, either exact or partial (Y. Tsuruoka et al., 2003), (J. Tsujii et al., 2003)

– Rule-based

• Define pattern rules (as in DNA sequence)

• E.g. EMPathiE and PASTA (K. Humphreys et al., 2000),(R. Gaizauskas et al., 2003)

• Classification based – Naïve bayes (K. Takeuchi et al., 2005)

– SVM’s (J. Kazama et al., 2002),(T. Mitsumori et al., 2005),(C. Nobata et al., 1999),(K. Yamamoto et al., 2003)

– BIO Tagging scheme; individual tokens are tagged • B- beginning of entity

• I- inside entity

• O- outside entity

• ISSUES: when boundaries overlap

• Question & Answering:

• Unlike general Q & A, – Domain specific quering resulting in crisp and

precise answers

• Different from other systems as – Limited scope of questions

– Crisp knowledge

• Currently drawing attention of researchers (Y. Hu et al., 2005) (S. Athenikos et al., 2010)

Resources

• MEDLINE

– One of the most important resources in biomedical domain for mining

– Stack of bibliographic material on bio-medicine from 1946 to 2013 and onwards

– Aka PubMed

– www.ncbi.nlm.nih.gov/pubmed

• Source: http://www.ncbi.nlm.nih.gov/pubmed/?term=lactobacilus

• OSHUMED – consisting of 348,566 references (out of a total of

over 7 million), covering all references from 270 medical journals over a five-year period (1987-1991), published data of over 5 years#

+

– more data from TREC Genomes track from 1994-2003##

– Cross references PubMed, clinically oriented subset of MEDLINE

#TREC-9 filtering track collections.http://trec.nist.gov/data/t9_filtering.html ##TREC genomics track data. http://ir.ohsu.edu/genomics/data.html.

Tools

• BioCreAtivE ® – L. Hirschman, M. Colosimo, A. Morgan, and A. Yeh. Overview of BioCreAtIvE task 1B:

Normalized gene lists.BMC Bioinformatics, 6(Suppl 1):S11, 2005.

• Metamap (MMTx) - NLM

– http://mmtx.nlm.nih.gov/

• Negex, Context – University of Pittsburg – BluLab

– http://www.dbmi.pitt.edu/blulab/index.html

• Ctakes – Mayo Clinic

– https://cabig-kc.nci.nih.gov/Vocab/KC/index.php/OHNLP_Documentation_and_Downloads

Bibliography

• U. Leser and J. Hakenberg. What makes a gene name? namedentity recognition in the Biomedical literature.Briefings in Bioinformatics, 6(4):357–369, 2005.

• L. Smith, L. Tanabe, R. Johnson nee Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. Struble, R. Povinelli, A. Vlachos, W. Baumgartner, L. Hunter, B. Carpenter, R. Tzong-Han Tsai, H.-J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. ManaLopez, J. Mata, and W. Wilbur. Overview of BioCreAtIve II: Gene mention recognition.Genome Biology, 9(Suppl 2):S2, 2008.

• O. Uzuner, B. R. South, S. Shen, and S. L. DuVall. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Association, 18(5):552–556, 2011.

• J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. InProceedings of the International Joint workshop on Natural LanguageProcessing in Biomedicine and its Applications, pages 70–75, 2004.

• M. Krauthammer and G. Nenadic. Term identification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512–526, 2004.

• Y. Tsuruoka and J. Tsujii. Boosting precision and recall of dictionary-based protein name recognition. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine - Volume 13, pages 41–48, 2003.

• Y. Tsuruoka and J. Tsujii. Probabilistic term variant generator

• for biomedical terms. In Proceedings of the26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 167–173, 2003.

• K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science yournal articles: Enzyme interactions and protein structures. InPacific Symposium on Biocomputing, pages 502–513, 2000.

• R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett. Protein structures and information extraction from biological texts: The PASTA system.Bioinformatics, 19(1):135–143, 2003.

• J. Kazama, T. Makino, Y. Ohta, and J. Tsujii. Tuning support vector machines for biomedical named entity recognition. InProceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain - Volume 3, pages 1–8,

2002. • K. Yamamoto, T. Kudo, A. Konagaya, and Y. Matsumoto. Protein name tagging for

biomedical annotation in text. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine - Volume 13, pages 65–72, 2003.

• H. Yu, C. Sable, and H. Zhu. Classifying medical questions based on an evidence taxonomy. InProceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains, 2005.

Thank You