Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding

download Multiple Alignment of Citation Sentences with Conditional Random Fields  and Posterior Decoding

If you can't read please download the document

description

Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding. Ariel Schwartz, Anna Divoli, and Marti Hearst University of California, Berkeley Supported in part by NSF DBI 0317510. Bioscience literature. Rich, complex and fast growing. - PowerPoint PPT Presentation

Transcript of Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding

  • Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior DecodingAriel Schwartz, Anna Divoli, and Marti HearstUniversity of California, BerkeleySupported in part by NSF DBI 0317510

  • Bioscience literatureRich, complex and fast growing.Online full text enables new forms of automatic document analysis, including caption search, and citation sentences analysis.CitancesNearly every statement in a bioscience journal article is backed up by a citation.It is common for papers to be cited 30-100 times.The text around the citation tends to state biological facts from the target paper. We term these citation sentences, or citances.Different citances state similar facts in different ways.

  • Papers are cited for some fact(s) until it is the case that many important facts in the field can be found in citationsentences alone!

  • Using citancesPotential applications of citancescreation of training and testing data for semantic analysis,synonym set creation, database curation, document summarization, and information retrieval generally.

    Nakov, Schwartz and Hearst. Citances: Citation Sentences for Semantic Analysis of Bioscience Text, in the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. All these applications require citance word alignments.Align together concepts that are semantically related in the context of the target paper.Related concepts can be expressed in several different ways in the citances.We focus here on the multiple citance alignment (MCA) problem.

  • Example of unaligned citancesIn response to genotoxic stress, Chk1 and Chk2 phosphorylate Cdc25A on N-terminal sites and target it rapidly for ubiquitin-dependent degradation (Mailand et al, 2000, 2002; Molinari et al, 2000; Falck et al, 2001; Shimuta et al, 2002; Busino et al, 2003), which is thought to be central to the S and G2 cell cycle checkpoints (Bartek and Lukas, 2003; Donzelli and Draetta, 2003). Given that Chk1 promotes Cdc25A turnover in response to DNA damage in vivo (Falck et al. 2001; Sorensen et al. 2003) and that Chk1 is required for Cdc25A ubiquitination by SCF-TRCP in vitro, we explored the role of Cdc25A phosphorylation in the ubiquitination process. Since activated phosphorylated Chk2-T68 is involved in phosphorylation and degradation of Cdc25A (Falck et al., 2001, Falck et al., 2002; Bartek and Lukas, 2003), we also examined the levels of Cdc25A in 2fTGH and U3A cells exposed to -IR.

  • Goal: Align similar conceptsresponse genotoxic stress Chk1 Chk2 phosphorylate Cdc25A N terminal sites target rapidly ubiquitin dependent degradation thought central S G2 cell cycle checkpointsGiven Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 required Cdc25A ubiquitination SCF beta TRCP vitro explored role Cdc25A phosphorylation ubiquitination processactivated phosphorylated Chk2 T68 involved phosphorylation degradation Cdc25A examined levels Cdc25A 2fTGH U3A cells exposed gamma IR

  • Multiple citance alignment (MCA)Goal: Partition the citances words/phrases into equivalence classes based on semantic homology.Orthographic similarity is important but does not always entail semantic homology: phosphorylate phosphorylationcell cycle U3A cells genotoxic stress DNA damage Related problems:Multiple sequence alignment (MSA) in genomics.Pairwise word alignment in statistical machine translation (SMT).

  • Formal definition of MCAPairwise citance alignment of citances Ci and Cj is an equivalence realtion ij. cik ij cjl means that the kth word in the ith citance is aligned to the jth word in the lth citance.

    Multiple citance alignment (MCA) is an equivalence relation ~, which is defined as the transitive closure of the union of all pairwise citance alignments:The transitive closure ensures that the equivalent classes (colors) are consistent across all pairwise citance alignments.

  • Algorithm outlineWe developed an MCA algorithm based on:Extension to our posterior decoding algorithm for MSA (AMAP, Schwartz and Pachter ECCB 2006).Modified version of the SMT pairwise word alignment model of Blunsom & Cohn (ACL 2006) for posterior probabilities calculation.

  • Algorithm outline

  • Algorithm outline

  • Utility function for MCARequirements for a good utility function:Correlated to the accuracy measure used for evaluation.Easily decomposable, for direct optimization using posterior-decoding.Metric-based (optional):Captures intuitive notion of distance.Triangle inequality provides bounds on the search space.AER and F-measure do not satisfy these criteria.

  • Alignment Metric Accuracy (AMA)We extend AMA (Schwartz et al 2006), a utility function for one-to-one MSA, to many-to-many MCA.

    Intuitively, UAMA measures the average word-level agreement between the predicted and reference MCAs.

    Uset_agreement is a score assigned to each word position based on the overlap between the sets of word positions the two alignments align to it.Can use Dice, Jaccard, or Hamming for example.We use the Braun-Blanquet coefficient.

  • Example of AMA for MCAEvery word gets a score between 0 and 1 based on level of agreement with the reference alignment.AMA is the average word score.In this example AMA = 13.83/ 20 = 0.692.Sum of pairs is used for multiple alignments.

  • Controlling the recall/precision tradeoffIn addition, two free parameters (match-factor , and gap-factor ) are added in order to provide control of the recall/precision tradeoff.The result is the following utility function:

  • Algorithm outline

  • Motivation for using a CRF modelSmall annotated sets for training, development, and testingMain challenge is to perform well on unseen words.Requires a discriminative model that can use different overlapping features, can incorporate contextual information,allows for computation of posterior probabilities.

  • CRFs based SMT word alignmentBlunsom and Cohn (ACL 2006) developed a CRF based pairwise word alignment model for SMT.Directional model every source word can be mapped to zero or one target words.Using Viterbi decoding.Features are functions of the implied source-target word-pairs.We modified the program to support MCA.Compute the directional marginal posterior probabilities using the forward-backward algorithm:

    Modified features.Implementation of a posterior-decoding algorithm for MCA instead of the Viterbi decoding for pairwise SMT word alignment.

  • Algorithm outline

  • Posterior decoding algorithm for MCAFor every pair of citances compute the directional posterior probabilities using a CRF.For every target word w, compute the combination of source words that maximize the expected utility of w.The (undirected) multiple word alignment is produced by taking the transitive closure of the union of individual word optimal alignments:

  • Decoding ExampleC1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25AC2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1C3: Chk2 T68 involved phosphorylation degradation Cdc25ATargetSourceC1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25AC2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25AC3: Chk2 T68 involved phosphorylation degradation Cdc25ALater on in the decoding process C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25AC3: Chk2 T68 involved phosphorylation degradation Cdc25ATargetSourceC3: Chk2 T68 involved phosphorylation degradation Cdc25AC1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A

  • Data sets3 sets of citances annotated by a PhD with biological training:Training set - 4 groups, 10 citances each (180 pairs).Development set 51 citances (1275 pairs).Test set 45 citances (990 pairs).Feature engineering using the training and development sets.Final results based on a model trained on training and development sets combined, and tested on the test set.Baseline using only normalized edit distance with a simple cutoff.

  • Features for MCAOrthographic featuresexact string match,normalized edit distance,prefix, suffix match,word lengths,capitalization.Local contextual featuresdistance between target words of adjacent source words,Word specific tendency to align like the previous/next word,Transition to, from, and between (un)aligned words.Biological ontology based featuresMedical Subject Headings (MeSH),Gene synonyms (Entrez Gene, Uniprot, OMIM).Lexical featuresWordnet similarity (Lin, 1998)

  • Results on pairwise alignmentsUnlike Viterbi decoding, posterior-decoding (PD) enables refined control of the recall/precision tradeoff.Viterbi_Union (0.531 recall at 0.913 precision) is comparable to PD with and set to 1 (0.540 recall at 0.909 precision).However, PD allows to increase the recall significantly by increasing and decreasing (0.636 recall at 0.517 precision for = 1.2 and = 0.1, or 0.742 recall at 0.198 precision for = 1.5 and = 0.05).

  • Results on MCAThe two curves overlap in the range between 0.52 and 0.55 recall (0.84 and 0.9 precision). Orthographic similarity is the dominant feature in this range.Unlike the baseline the CRF+PD system keeps improving recall without a sharp drop in precision up to 0.636 recall at 0.748 precision. This is due to the incorporation of multiple overlapping features.The CRF+PD system also achieves better precision than the baseline (0.982 precision at 0.381 recall vs. 0.937 precision at 0.346 recall).

  • Error analysisPerformed error analysis on MCA with best F-measure (0.690).Out of 1400 unique errors 1194 (85.3%) are false-negatives, and 206 (14.7%) are false-positives.Most errors are due to misalignment ofsubtypes (cdc, cdc6, cdc25A),opposites (phosphorylated and unphosphorylated),and complex entities (cell cycle and cell line).Many FN errors are due to not aligning entities in only 4 equivalence classes (e.g., 97 FN in the class of motif, site and domain).Other types of errors:not aligning plural and singular forms of the same entities,aligning only part of part of multi-word entities,and incorrectly aligning orthographically similar entites.

  • ContributionsDefined the MCA problem.Developed a posterior-decoding algorithm for MCA.Advantages of posterior-decoding over Viterbi:Directly optimize the expected (metric-based) utility.Control of recall / precision tradeoff.Developed AMA for MCAA metric based accuracy measure for MCA.Balances recall and precision in one measure.The expected AMA can be optimized directly with posterior-decoding (unlike AER or F-Measure).Can also be used for SMT alignments.