An Adaptive Early Stopping Strategy for Query-based Passage Re … · 2020-02-13 · An Adaptive...

4
An Adaptive Early Stopping Strategy for Query-based Passage Re-ranking Chengxuan Ying [email protected] Dalian University of Technology Dalian, China Chen Huo [email protected] WeiXin Group, Tencent Inc. Guangzhou, China Abstract Recent research on the semantic retrieval task shows that the pretrained language models like BERT have impressive re-ranking performance[4]. In the re-ranking process, the fine-tuned language model is feed with (query, document) pairs and the whole time complexity is directly proportional to both the query size and the recall set size. In this paper, we describe a simple yet effective strategy of early stopping strategy based on the confidence score. In our experiment, such strategy can avoid up to 30% unnecessary inference computation cost without sacrificing much ranking precision. The codes and documents are available at hps://github.com/ chengsyuan/WSDM-Adhoc-Document-Retrieval. Our team dlutycx ranked first on the unleak track. Keywords. Passage Re-ranking ,Semantic Retrieval, WSDM Cup 2020 ACM Reference Format: Chengxuan Ying and Chen Huo. 2020. An Adaptive Early Stopping Strategy for Query-based Passage Re-ranking. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 4 pages. hps://doi.org/10.1145/1122445.1122456 1 Introduction Modern search engines retrieve web documents mainly by matching keywords in documents with those in search queries. Though they are fast enough, lexical matching can be inac- curate due to the academic sentence rewriting (some re- searchers may rewrite the sentence to reduce the rate of duplicate checking). In this situation, a concept can be an expression in different vocabularies, and hence the lexical similarity can be fairly low. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 hps://doi.org/10.1145/1122445.1122456 To achieve better search results, the semantic informa- tion of the query and the document are considered in the re- ranking process. The DSSM model [1] is first proposed to use the multi-layer neural network to map the high-dimensional text features into low-dimensional dense features. This method significantly outperforms the traditional lexical ranking model, e.g., TF-IDF and BM25. To further capture the semantic in- formation, convolutional neural network architecture is im- ported to represent the query and the document [5]. Their model demonstrates strong performance gains on the twitter re-ranking task. And more recently, the transformer-based models, such as Bert, have achieved more impressive results on the passage re-ranking task MS-MACRO[4]. Though these models show impressive re-ranking perfor- mance gains, in the production or time-limited competition, it is only practical to re-rank several documents at top k po- sition of the recall set. In this paper, we propose a simple yet effective early stopping strategy for the re-ranking process. 2 Pipeline Our solution for the ad-hoc document retrieval consists of three main stages: Cleaning: the documents with missing data are re- moved and the texts unrelated to the task are also deleted. Recalling: a recall set to a given question is retrieved from the whole candidate document database by an unsupervised manner, such as BM25 or document em- bedding similarity. Re-ranking: each of these documents is scored and re- ranked by a more computationally-intensive method. 2.1 Cleaning In the cleaning step, we simply remove the missing data. Then we clean the text which is not directly related to the topic. Concretely speaking, we remove each of the sentences that is not a citation ("**##**"). 2.2 Recalling In the recall step, we use the Okapi BM25 [2] to measure the lexical similarity between query and document. The formula is as follows:

Transcript of An Adaptive Early Stopping Strategy for Query-based Passage Re … · 2020-02-13 · An Adaptive...

An Adaptive Early Stopping Strategyfor Query-based Passage Re-rankingChengxuan Ying

[email protected] University of Technology

Dalian, China

Chen [email protected]

WeiXin Group, Tencent Inc.Guangzhou, China

AbstractRecent research on the semantic retrieval task shows thatthe pretrained language models like BERT have impressivere-ranking performance[4]. In the re-ranking process, thefine-tuned language model is feed with (query, document)pairs and the whole time complexity is directly proportionalto both the query size and the recall set size. In this paper,we describe a simple yet effective strategy of early stoppingstrategy based on the confidence score. In our experiment,such strategy can avoid up to 30% unnecessary inferencecomputation cost without sacrificingmuch ranking precision.The codes and documents are available at https://github.com/chengsyuan/WSDM-Adhoc-Document-Retrieval. Our teamdlutycx ranked first on the unleak track.Keywords. Passage Re-ranking ,Semantic Retrieval, WSDMCup 2020

ACM Reference Format:Chengxuan Ying and Chen Huo. 2020. An Adaptive Early StoppingStrategy for Query-based Passage Re-ranking. In Proceedings ofACM Conference (Conference’17).ACM, New York, NY, USA, 4 pages.https://doi.org/10.1145/1122445.1122456

1 IntroductionModern search engines retrieve web documents mainly bymatching keywords in documentswith those in search queries.Though they are fast enough, lexical matching can be inac-curate due to the academic sentence rewriting (some re-searchers may rewrite the sentence to reduce the rate ofduplicate checking). In this situation, a concept can be anexpression in different vocabularies, and hence the lexicalsimilarity can be fairly low.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’17, July 2017, Washington, DC, USA© 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/1122445.1122456

To achieve better search results, the semantic informa-tion of the query and the document are considered in the re-ranking process. The DSSMmodel [1] is first proposed to usethe multi-layer neural network to map the high-dimensionaltext features into low-dimensional dense features. Thismethodsignificantly outperforms the traditional lexical rankingmodel,e.g., TF-IDF and BM25. To further capture the semantic in-formation, convolutional neural network architecture is im-ported to represent the query and the document [5]. Theirmodel demonstrates strong performance gains on the twitterre-ranking task. And more recently, the transformer-basedmodels, such as Bert, have achieved more impressive resultson the passage re-ranking task MS-MACRO[4].

Though these models show impressive re-ranking perfor-mance gains, in the production or time-limited competition,it is only practical to re-rank several documents at top k po-sition of the recall set. In this paper, we propose a simple yeteffective early stopping strategy for the re-ranking process.

2 PipelineOur solution for the ad-hoc document retrieval consists ofthree main stages:

• Cleaning: the documents with missing data are re-moved and the texts unrelated to the task are alsodeleted.

• Recalling: a recall set to a given question is retrievedfrom the whole candidate document database by anunsupervised manner, such as BM25 or document em-bedding similarity.

• Re-ranking: each of these documents is scored and re-ranked by a more computationally-intensive method.

2.1 CleaningIn the cleaning step, we simply remove the missing data.Then we clean the text which is not directly related to thetopic. Concretely speaking, we remove each of the sentencesthat is not a citation ("**##**").

2.2 RecallingIn the recall step, we use the Okapi BM25 [2] to measure thelexical similarity between query and document. The formulais as follows:

Conference’17, July 2017, Washington, DC, USA Chengxuan Ying and Chen Huo

𝑅𝑆𝑉𝑑 =∑𝑡 ∈𝑞

log[𝑁

df𝑡

]· (𝑘1 + 1) tf𝑡𝑑𝑘1 ((1 − 𝑏) + 𝑏 × (𝐿𝑑/𝐿𝑎𝑣𝑒 )) + tf𝑡𝑑

(1)where 𝑘1 suppresses the long document. After several exper-iments on the validation set, we set 𝑘1 = 2 and 𝑏 = 0.75.

2.3 Re-rankingIn the re-ranking step, we use the pretrained BioBERT [3] toget the similarity score [4]. The model details are illustratedin Figure 1.

Figure 1.We truncate the query to have at most 64 tokens.We also truncate the passage text such that the concatenationof query, passage, and separator tokens have the maximumlength of 512 tokens. Then we use the [CLS] vector as inputto a single layer neural network to obtain the probability ofthe passage being relevant.

Then, binary cross-entropy loss is adopted to fine-tunethe BioBERT:

𝐿 = −∑𝑗 ∈𝐽pos

log(𝑠 𝑗)−

∑𝑗 ∈𝐽neg

log(1 − 𝑠 𝑗

)(2)

where 𝐽pos is the set of indexes of the relevant passagesand 𝐽neg is the set of indexes of non-relevant passages in top-20 documents retrieved with BM25. To balance the pos-negrate, we over-sample the positive documents 19x.

After fine-tuning the BioBERT, we use this model as a fixedscorer 𝑓 (𝑞, 𝑑) during the re-ranking inference. In the follow-ing algorithm, we describe the regular re-ranking strategywhich is widely used:

As shown in Algorithm 1, the regular re-ranking strategyis to simply iterate over every document in the recall set.As we can observe in Figure 2, the truth documents are not

Algorithm 1 Regular Re-ranking StrategyInput:

A scorer 𝑓 (𝑞, 𝑑) to measure the similarity;A query 𝑞 ;The recall set 𝐷 (𝑞) of the query;

Output:The re-ranked documents list 𝑅(𝑞)

1: 𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑚𝑝𝑡𝑦𝐿𝑖𝑠𝑡 ();2: for each 𝑑 ∈ 𝐷 (𝑞) do3: 𝑠𝑐𝑜𝑟𝑒.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑓 (𝑞, 𝑑));4: end for5: Sort and Return 𝑅(𝑞);

uniformly distributed. They aggregate on the top position. Inorder to solve such a problem, we design an early stoppingstrategy. As shown in Algorithm 2, when the re-ranker (fine-tuned BERT model) shows high confidence, we can assumethis document as the most relative document.

Figure 2. The heatmap shows the top 200 recall set and eachpixel represents a document of a given query in a row. Thewhite pixels indicate relevant.

As shown in Figure 3, the distribution of the maximumscores is not the same as Figure 2. If the Algorithm 2 isadopted in the re-ranking process, we may mis-retrieve theirrelevant document as a positive document if the score ofthe false positive-document is higher than the threshold .

To mitigate such a problem, we propose an adaptive earlystopping re-ranking strategy as shown in Algorithm 3. Webelieve that the experience-based batch size 𝑏 can reducethe false-positive documents and achieve the MAP@3 gain(mean average precision at 3, which is the metric of theleaderboard).

3 Experiments3.1 DatasetThe competition provides a large paper dataset, which con-tains roughly 800K papers, along with paragraphs or sen-tences which describe the research papers. These pieces of

An Adaptive Early Stopping Strategyfor Query-based Passage Re-ranking Conference’17, July 2017, Washington, DC, USA

Figure 3. The heatmap shows the relevant score of each(query, document) pair scored by the previously fine-tunedBioBERT model.

Algorithm 2 Early Stopping Re-ranking StrategyInput:

A scorer 𝑓 (𝑞, 𝑑) to measure the similarity;A query 𝑞 ;The recall set 𝐷 (𝑞) of the query;A threshold 𝑡 set by experience;

Output:The re-ranked documents list 𝑅(𝑞)

1: 𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑚𝑝𝑡𝑦𝐿𝑖𝑠𝑡 ();2: for each 𝑑 ∈ 𝐷 (𝑞) do3: 𝑠𝑐𝑜𝑟𝑒.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑓 (𝑞, 𝑑));4: if 𝑓 (𝑞, 𝑑) > 𝑡 then5: Break;6: end if7: end for8: Sort and Return 𝑅(𝑞);

description are mainly from paper text which introducescitations.

For example:Description: An efficient implementation based on BERT

[1] and graph neural network (GNN) [2] is introduced.Related Papers:[1] BERT: Pre-training of deep bidirectional transformers

for language understanding.[2] Relational inductive biases, deep learning, and graph

networks.

Our task is to retrieve the related paper from the candidateset (800K papers), given the description.

3.2 Validation Of Our Proposed MethodTo validate our proposed method, we conduct several exper-iments on a tiny validation set with 100 samples. As shownin Table 1, the regular re-ranking strategy can be considered

Algorithm 3 Adaptive Early Stopping Re-ranking StrategyInput:

A scorer 𝑓 (𝑞, 𝑑) to measure the similarity;A query 𝑞 ;The recall set 𝐷 (𝑞) of the query;A threshold 𝑡 set by experience;Batch size 𝑏 set by experience;

Output:The re-ranked documents list 𝑅(𝑞)

1: 𝑠𝑐𝑜𝑟𝑒 = 𝑒𝑚𝑝𝑡𝑦𝐿𝑖𝑠𝑡 ();2: for each 𝑑 ∈ 𝐷 (𝑞) do3: 𝑠𝑐𝑜𝑟𝑒.𝑎𝑝𝑝𝑒𝑛𝑑 (𝑓 (𝑞, 𝑑));4: if max 𝑠𝑐𝑜𝑟𝑒 > 𝑡 & 𝑠𝑐𝑜𝑟𝑒.𝑙𝑒𝑛𝑔𝑡ℎ == 𝑘 · 𝑏 (𝑘 >= 1)

then5: Break;6: end if7: end for8: Sort and Return 𝑅(𝑞);

Table 1. Comparison of three algorithms, tested on 100 sam-ples. The empirical threshold 𝑡 is set to 3.0.

Method K MAP@3 Time Cost

Regular Strategy 5 0.4217 1xRegular Strategy 10 0.4633 2xRegular Strategy 50 0.5350 10xRegular Strategy 100 0.5217 20xEarly Stopping 5 0.4217 0.5xEarly Stopping 10 0.4533 1xEarly Stopping 50 0.5200 5xEarly Stopping 100 0.5067 10x

Adaptive Early Stopping 5 0.4214 0.6xAdaptive Early Stopping 10 0.4667 1.4xAdaptive Early Stopping 50 0.5350 6.6xAdaptive Early Stopping 100 0.5217 13.0x

as the performance upper-bound with the highest time cost.In contrast to the regular one, our proposed early stoppingstrategy shows the least time cost while showing the highestperformance loss. A possible explanation is the existence ofthe false-positive document. A higher threshold 𝑡 set by ex-perience or adopting the adaptive re-ranking strategy wouldbe helpful. Finally, experiments show that the eclectic adap-tive re-ranking strategy saves about 40% re-ranking time andsacrifices the least performance.

3.3 Test SetWe select several commits on the test leaderboard. As shownin Table 2, the regular strategy shows the least [email protected] increasing k, MAP@3 increases.

Conference’17, July 2017, Washington, DC, USA Chengxuan Ying and Chen Huo

Table 2. Performance on the Test Set.

Method MAP@3

Regular Strategy (k=50) 0.3776+Adaptive Early Stopping(k=100) 0.3890+Adaptive Early Stopping(k=200) 0.3974+Adaptive Early Stopping(k=400) 0.4022+Adaptive Early Stopping(k=600) 0.4029

4 ConclusionIn this competition, we implemented a simple but effectivesystem to solve WSDM - DiggSci 2020 competition, and ourend-to-end pipeline is clean and high extensible without thehelp of feature engineering. Our solution finally ranked 2ndin the validation set and 4th in the test set, and without dataleak, our solution ranked 1st in the test set.

Our solution can be summarized as follows.1. We performed data cleaning on the dataset according

to self-designed saliency-based rules, and removed the re-dundancy data.2. Before reranking, we used the bm25 metric. For faster

calculating, we adopted the cupy to accelerate the calculation.Compared with traditional algorithm, the matrix multipli-cation with cupy can be done in 15 minutes using a singleGPU card.

3. We used the fine-tuned BioBERT model to score every(query, document) pair and designed a novel early stoppingstrategy for re-ranking based on the confidence score toavoid up to 40% unnecessary inference time cost of the BERT.

5 Future WorkThough the adaptive early stopping strategy is effective,there are still two hyper-parameters to set by human experi-ence. It is worthy to develop a method to make our proposedmethod fully automated.

6 AcknowledgmentsWe really appreciate the help from Yanming Shen, who pro-vided a 8-GPU server for 4 days. Also specially thanks toHaibo Wang for his effort on the proofreading.

References[1] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and

Larry Heck. 2013. Learning deep structured semantic models for websearch using clickthrough data. In Proceedings of the 22nd ACM interna-tional conference on Information & Knowledge Management. 2333–2338.

[2] K Sparck Jones, Steve Walker, and Stephen Robertson. 2000. A proba-bilistic model of information retrieval: development and comparativeexperiments. Information Processing and Management 36, 6 (2000), 779–808.

[3] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, SunkyuKim, ChanHo So, and Jaewoo Kang. 2019. BioBERT: pre-trained biomed-ical language representation model for biomedical text mining. arXivpreprint arXiv:1901.08746 (2019).

[4] Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking withBERT. arXiv preprint arXiv:1901.04085 (2019).

[5] Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to RankShort Text Pairs with Convolutional Deep Neural Networks. (2015),373–382.