The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu...

28
The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1 , Junqiu Wu 2 , Shengwen Peng 1 ,Chengxiang Zhai 3 , Shanfeng Zhu 1 1 Fudan University 2 Central South University, 3 University of Illinois at Urbana- Champaign [email protected]

Transcript of The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu...

Page 1: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

1Fudan University2 Central South University,

3 University of Illinois at Urbana-Champaign

[email protected]

Page 2: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

IntroductionRelated WorkOur MethodsExperimental ResultConclusion

Outline

Page 3: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Introduction

MeSH Terms

Each year, around 0.8 million biomedical documents are added into MEDLINE.

Page 4: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Indexing all documents in MEDLINE

Indexing many books and collections in NLM

Improving the retrieval performance by query expansion using MeSH

Improving the clustering performance by integrating MeSH information [Zhu et al. 2009 IP&M] [Zhe et al. 2009 Bioinformatics] [Gu et al. 2013 IEEE TSMCB]

Improving the biomedical text mining performance

MeSH is Important

Page 5: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Automatic MeSH Annotation is a challenging problem

More than 26,000 MeSH headings organized in hierarchical structure

Quickly approaching 1,000,000 articles indexed per year

~$9.40 to index an article

Page 6: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

The number of distinct MeSH is large (almost 27000)

The large variations of MeSH frequencies in MEDLINE

The large variations in the number of MeSH terms for each document

Page 7: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

BioASQ (Large Scale Biomedical Semantic Indexing Competition )

Batch 3, week 1 4342 docs Batch 3, week 2 8840 docsBatch 3, week 3 3702 docsBatch 3, week 4 4726 docsBatch 3, week 5 4533 docs

Page 8: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Label Based Micro F1-measure (MiF)

L represent the label set, |L| represents the number of labels. It means that Frequent labels will be weighed more in the

evaluation.

Page 9: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Batch 3, week 54533 docs

Fudan University

NLM Current Solution

We achieved around 10% improvement over current NLM MTI solution (Result of June 2014)

Page 10: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

IntroductionRelated WorkOur MethodsExperimental ResultConclusion

Outline

Page 11: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Two sources:MetaMap Indexing

Maps UMLS Concepts restricting to MeSH

PubMed Related Citations

•reference: http://ii.nlm.nih.gov/MTI/history.shtml

Advanced machine learning algorithms are not utilized

NLM approach: MTI

Page 12: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

MetaLabeler (Tsoumakas et al. 2013)Firstly, for each MeSH heading, a binary classification

model was trained using linear SVM.

Secondly, a regression model was trained to predict the number of MeSH headings for each citation.

Finally, given a target citation, different MeSH headings were ranked according to the SVM prediction score of each classifier, and the top K MeSH headings were returned as the suggested MeSH headings, where K is the number of predicted MeSH headings by the model.

Problem: Only use global information.

The scores from different classifiers are not comparable.

Page 13: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

NCBI’s learning to rank (LTR) (Huang et al., 2011; Mao et al., 2013)

Each citation was deemed as a query and each MeSH headings as a document.

LTR method was utilized to rank candidate MeSH headings with respect to target citation.

The candidate MeSH headings came from similar citations (nearest neighbors).

Problem: Only use local information.

Similar citations might be rare.

Page 14: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

IntroductionRelated WorkOur MethodsExperimental ResultConclusion

Outline

Page 15: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Our solution: Learning to Rank (LTR) Framework

PRA

* Retrieve Similar documents

MH-0

MH-1

MH-2

MH-m

MH-0

MH-1

MH-2

Logistic Regression

MH-n

Initial List

* Generate features of main

headingsRanking

model

Features

LambdaMart

MH-0

MH-1

MH-2

MH-n

Ranked List

Evaluation

* Obtain an initial list of main headings

* Rank the main headings

Target Doc

Page 16: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Given a target document, for each candidate MeSH, we get prediction scores from all kinds of sources:

(1) Logistic Regression (Global information)

(2) KNN (Local information)

(3) Pattern Matching

(4) MTI result (KNN+ pattern +rule)

Main idea: various information (Features) integrated in the Learning to Rank (LTR) Framework

Page 17: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Logistic Regression

Train a binary-class Logistic Regression Model for Each Label. Finally we have 25,000+ binary-class models

Question: The prediction scores are from different classifiers. How to make these scores comparable?

Page 18: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Key idea: We have huge validation set of whole MEDLINE. Use the Precision at prediction score K as the Normalized score

[Liu et al., In preparation]

Page 19: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

The performance comparison on LR between default prediction scores and our normalized scores.

Method mip mir mif

Default scores 0.5576 0.5614 0.5595

Normalized scores 0.5734 0.5774 0.5754

[Liu et al., In preparation]

Page 20: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

KNN

Given a target citation, we used NCBI efetch to find its similar(neighbor) citations.

For a candidate MeSH, we compute a score from neighbors to represent its confidence.

Specifically, in Top 25 documents most similar to target citation, we use the following formula, where Si is the score of a document appearing in top 25, Sk is the score of a document not only appearing in top 25 and also annotated with candidate MeSH.

Page 21: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Use direct string pattern matching for finding MeSH Term, its synonyms and entry terms from the text

Pattern matching

MTI

Whether the candidate MeSH appears in the default results of MTI

Page 22: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

The number of MeSH Labels

A Support Vector Regression for predicting the Number of Labels by using a number of features, such as

Journal information

The number of labels in nearest neighbors

Number of labels predicted by MTI

Number of labels predict by metalabeler

Page 23: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

IntroductionRelated WorkOur MethodsExperimental ResultConclusion

Outline

Page 24: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Server ◦ 4 * Intel XEON E5-4650 2.7GHzs CPU ◦ 128GB RAM.

Training of LR Classifiers took 5 days.All other training tasks took 1 day.Annotating 10,000 citations 2 hours.

Evaluation & Experiment

Page 25: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Evaluation & Experiment

Page 26: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

IntroductionRelated WorkOur MethodsExperimental ResultConclusion

Outline

Page 27: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

The superior performance of our methods come from integrating all kinds of information in LTR framework, MTI, KNN, LR as well as direct matching .

The big data of MEDLINE make the prediction score normalization possible and improves the performance significantly.

More information could be used, such as full text, and indexing rules.

How to minimize the gap between a good competition system and real applications?

Conclusion & Future Work

Page 28: The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu 1, Junqiu Wu 2, Shengwen Peng 1,Chengxiang Zhai 3, Shanfeng Zhu.

Dr. Hongning Wang UIUCMr. Mingjie Qian UIUCMr. Jieyao Deng FudanMr. Tianyi Peng Tsinghua

Acknowledgement