Download - Recognizing Text Entailment - Tutorial

Text EntailmentDarsh ShahPratyaksh Sharma

Introduction to Textual Entailment Textual Entailment can be defined as the

phenomenon of inferring a text from another

A text t entails hypothesis h if h is true in every circumstance of possible world in which t is true

Definition Continued

This definition is very strict it requires truthfulness of h in all the instances where t is true

ExampleT: Sachin received an award for

batsmanship, from the ICC.H:The God of Cricket received an award.

Definition Continued

T entails H only when Sachin is Sachin Tendulkar. This is the more likely situation, but not true always.

So a modified definition is required Applied Definition:A text t entails

hypothesis h if human reading t will infer that h is mostly likely true

Mathematical Definition

Hypothesis h is entailed by text t if P(h is true|t) > P(h is true)

Where P(h is true | t) is the Entailment Confidence and can be considered as a measure of surety of entailment

Entailment Triggers

Semantic phenomena significant to Textual Entailment T: Sachin achieved the milestone of 100 centuries in his career.H: Sachin attained the milestone of 100 centuries in his career.The two words are synonyms

Generalizations or specializations of concepts in Text or Hypothesis can affect entailment

ExampleT: Sachin Tendulkar is a cricketer.H:Sachin Tendulkar is a sportsman.Here sportsman is a generalization of sportsman

Other triggers are of the form Verb Entailment, entailment through change of Quantifiers, could trigger entailment

Polarity, factivity, implicative verbs, iteratives could also lead to entailment

Applications of Textual Entailment

Many natural language processing applications, like Question Answering(QA), Information Extraction(IE), (multi-document) summarization and machine translation (MT) evaluation

Information Retrieval

Textual entailment impacts IR in at least two ways

Notion of relevance bears strong similarity with that of entailment

Textual entailment can be used to find affinities between various words, that can used to compute an extended similarity between documents and queries

Question Answering

A given text T is retrieved for Question Q All entities in text T are substituted as

potential answers to obtain candidate hypothesis H1,H2,...Hn.

We then pick the best entailed Hi for the given text T to be the answer for the question Q.

Machine Translation Evaluation

Machine Translation evaluation involves comparing the machine translated sentence with the reference output

Textual entailment helps in this case as it gives a measure of similarity of the information conveyed by the reference and machine output

Miscellaneous

Equivalence between two text pairs can be used by applying textual entailment from both sides. This is useful for novelty detection,copying detection etc

Text simplification, substituting complex phrases by simpler phrases producing sentences that are grammatically correct and convey the meaning in a simpler way

Some basic approaches Implemented

Plain word matching1. Calculate matching words between the text

and the hypothesis2. Score = (# matching_words)/(# words in

hypothesis)

Results

We need to calculate an entailment threshold, above which well declare entailment.We find the best accuracy giving threshold on the training set.With a threshold = 0.55, We get accuracy = 0.6138 on the RTE2 development set.

Test CasesT:The Rolling Stones kicked off their latest tour on Sunday with a concert at Boston's Fenway Park.

H:The Rolling Stones have begun their latest tour with a concert in Boston.

YesCorrectly identifies

Test CasesT:Craig Conway, fired as PeopleSoft's chief executive officer before the company was bought by Oracle, was in England last week.

H:Craig Conway works for Oracle.

NOFails to identify, calls a yes

Conclusion Too inaccurate a method

Cant differentiate between a sentence and its negation in the simplest form, might say that the entailment is true

Some basic approaches

Plain lemma matching1. Lemmatize the text and the hypothesis2. Calculate matching lemmas between the

two3. Score = (# matching_lemmas)/(# lemmas in

hypothesis)

Results

We need to calculate an entailment threshold, above which well declare entailment.

With a threshold = 0.63, We get accuracy = 0.625 on the RTE2 development set.

Test CaseH:Sunday's earthquake was felt in the southern Indian city of Madras on the mainland, as well as other parts of south India. The Naval meteorological office in Port Blair said it was the second biggest aftershock after the Dec. 26 earthquake.T:The city of Madras is located in Southern India.YESEntails correctly

Test CaseH:ECB spokeswoman, Regina Schueller, declined to comment on a report in Italy's La Repubblica newspaper that the ECB council will discuss Mr. Fazio's role in the takeover fight at its Sept. 15 meeting.T:Regina Schueller works for Italy's La Repubblica newspaper.NOEntails incorrectly

Observations Again not dependable for even moderately

complicated sentences


Lemma + POS matching1. Lemmatize the text and the hypothesis2. Label with POS tags3. Calculate number of matching (lemma,

POS_tag) between the two4. Score = (# matches)/(# lemmas in

hypothesis)

Results

We need to calculate an entailment threshold, above which well declare entailment.

With a threshold = 0.63, We get accuracy = 0.6225 on the RTE2 development set

Test CaseH:It is also an acronym that stands for Islamic Resistance Movement, a militant Islamist Palestinian organization that opposes the existence of the state of Israel and favors the creation of an Islamic state in Palestine.T:The Islamic Resistance Movement is also known as the Militant Islamic Palestinian Organization.NOFails to entail correctly


Using the BLEU algorithmBasically, the algorithm looks for n-gram coincidences between a candidate text.It can be used as a basic lexical level benchmark for other textual entailment methods

BLEU algorithm

For several values of N (typically from 1 to 4), calculate the percentage of n-grams from the hypothesis which appears in any of the text.

Combine the marks obtained for each value of N, as a weighted linear average.

BLEU algorithm Apply a brevity factor to penalise short texts

(which may have n-grams in common with the references, but may be incomplete).

Higher the BLEU score, higher the entailment.Learn a threshold for the bleu score from the development score.

Results from BLEULearned threshold = 0.0585Which means only 5.85% of n-gram matches and we declare entailment!

Still, accuracy on RTE2 development set with this parameter = 0.6050

Test CasesH:Patricia Amy Messier and Eugene W. Weaver were married May 28 at St. Clare Roman Catholic Church in North Palm Beach.T:Eugene W. Weaver is the husband of Patricia Amy.

YesEntails CorrectlyPossibly because of very low threshold used. Other systems fail to predict this

Conclusion

Fails to understand deep semantic relations of sentence pairs like the previous ones

It can be used as a baseline technique,quick to evaluate

A Discourse Commitment-Based Framework for Recognizing Textual Entailment

New framework for recognizing Textual Entailment, that depends on the set of publicly held beliefs known as discourse commitments- that can be ascribed to the author of a text or a hypothesis

Inspiration for the approach

Shallow approaches had been moderately successful in the previous 2 RTEs

These approaches would fail as the sentences became larger and more syntactically complex

Formal Definition of the Problem

Given a commitment set {ct} consisting of the set of discourse commitments inferable from a text t and a hypothesis h, define the task of RTE as a search for the commitment c {ct} which maximizes the likelihood that t textually entails h

System Architecture

Extracting Discourse Commitments

After preprocessing, some heuristics are used to extract discourse commitments

Sentence Segmentation,Syntactic Decomposition,Supplementary Expressions,Relational Extraction, Coreference Resolution

Commitment Selection

Following Commitment Extraction,a word alignment technique first introduced in (Taskar et al., 2005b) was used in order to select the commitment extracted from t (henceforth, ct) which represents the best alignment for each of the commitments extracted from h (henceforth, ch)

The alignment of two discourse commitments can be cast as a maximum weighted matching problem in which each pair of words (ti ,hj ) in an commitment pair (ct ,ch) is assigned a score sij (t, h) corresponding to the likelihood that ti is aligned to hj

In order to compute a set of parameters w which maximize the number of correct alignment predictions (y) in a given training set (x)

Features used in the model

string features (including Levenshtein edit distance, string equality, and stemmed string equality)

lexico-semantic features (including WordNet Similarity and named entity similarity equality)

word association features

Following alignment,the method uses the sum of the edge scores

Search for ct that represents the reciprocal best hit

That is,selecting a commitment pair (ct , ch) where ct was the top scoring alignment candidate for ch and ch was the top-scoring alignment candidate for ct

Entailment and Results

Textual entailment selection is done based on the decision tree shown in the system

architecture The following shows the results on RTE-3

test dataset

IKOMA

One of the best performing submissions in RTE-7 (Text Analysis Conference 2011)

Title: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features

Had the highest F-measure (48.00) on the dataset. Next best was 45.13

Approach

First, calculate an entailment score based on lexical-level matching.Combine it with machine learning based filtering using various features obtained from lexical-level, chunk-level and predicate argument structure-level information.

Approach

Role of filtering: to discard T-H pairs that have high entailment score but are not actually. Using higher features than lexical level.

SENNA is used for analyzing POS of words, word chunks, NER and predicate-argument structures

Knowledge resources used

Acronyms extracted from the corpus: created for organizational names with more than three words.

WordNet CatVar: contains categorical variations of

English lexemes.

Lexical Entailment Score

R: set of knowlegde resources. Tt and Ht = set of words in each T and H.

freq(t) is frequency of t in corpus.


match(t, Tt, R) takes 1 if word t corresponds to a word in Tt (also consider synonyms and derived words from R); otherwise match() takes the value 0.


The Lexical Entailment Score is calculated for all H-T pairs in the development set and a threshold is chosen which gives the highest micro-average F-measure. Experiments are also done to find the optimum value of in equation (1). By testing: we find = 1.8 to be optimal

Filtering stage

We train a model that classifies T-H pairs having high LES into false-positive or true-positive. If the model predicts a T-H pair as false-positive, then we discard that pair from entailment T-H pair candidates.

Features for classifier

The lib-svm package is used, with features like: lexical-level:

Entailment Score ent_sc Cosine similarity Entailment score, comparing only words with same

POS tag


Chunk level Matching ratios for each chunk types (e.g. NP and

VP) in all corresponding chunk pairs PAS level

Matching ratio for each argument type (A0, A1) in all corresponding PAS pairs for each semantic relation of two predicates


Chunk level Matching ratios for each chunk types (e.g. NP and

VP) in all corresponding chunk pairs PAS level: For all corresponding PAS pairs:

Matching ratio for each argument type (A0, A1) Number of negation mismatch Number of modal verb mismatch Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment.

Features for classifier Chunk level

Matching ratios for each chunk types (e.g. NP and VP) in all corresponding chunk pairs

PAS level: For all corresponding PAS pairs: Matching ratio for each argument type (A0, A1)Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment or no-relation

Computing the features

For acquiring the above features in chunk and PAS level, we need to detect corresponding pairs that should be checked for testing whether the pairs have entailment.

Also need to detect whether such corresponding pairs are in entailment relation.

For the first problem

1. Transform all words contained in PAS into a word vector using bag of words representation

2. Calculate the cosine similarity for all PAS pairs that are generated by combining PAS from each T and H.

3. We regard the most similar PAS from T for each PAS from H as corresponding pairs.

For the latter problem

1. For each corresponding pair, we calculate our lexical entailment score between the words of each argument type of the PAS from H (as H in equation 1) and the words of the same argument type of the PAS from T (as T in equation 1)

2. Apply a threshold (pre-defined) to identify entailment

Results

Three solver were submitted: 1. IKOMA1: lexical entailment score + filtering

with threshold set empirically2. IKOMA2: same as IKOMA1 with threshold 03. IKOMA3: lexical entailment score only

Results

MaxSim: An automatic metric for Machine Translation Evaluation based on maximum similarity The metric calculates a similarity score

between a pair of English system-reference sentences by comparing information items such as n-grams across the sentence pair

Unlike most metrics, MAXSIM computes a similarity score between items

Then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence

Evaluation on the WMT07, WMT08, and MT06 datasets show that MAXSIM achieves good correlations with human judgment

Given a pair of English sentences to be compared, MaxSim performs tokenization, lemmatization using Word-Net and Part of Speech (POS) tagging

Next, all non-alphanumeric tokens are removed

Set of wordnet synonyms are gathered for each word, which are used for computing similarity

To calculate a similarity score for a pair of system reference translation sentences, MAXSIM extracts and compares n-gram information

Based on these comparisons or matches across the sentence pairs, MaxSim computes precision and recall

Matching Using N-gram Information

Phases of n-grams

To match n-grams, MAXSIM goes through a sequence of three phases: lemma and POS matching, lemma matching, and bipartite graph matching

We will illustrate the matching process using unigrams, then describe the extension to bigrams and trigrams

Lemma and POS-tag matching:An exact match on n-gram and POS-tag is applied

In all n-gram matching, each n-gram in the system translation can only match at most one n-gram in the reference translation

Lemma Matching: For the remaining unmatched n-grams, a relaxed condition of just lemma match is used

Bipartite Graph Matching: For the remaining

unmatched unigrams, matches are made by constructing a weighted complete bipartite graph The remaining unigrams form the nodes of

the graph The weights are the a sum of the wordnet

similarity between two word nodes and the identity function on whether or not they have the same POS tag

Calculation of F-score

Scoring a Sentence Pair and the Whole Corpus For a sentence pair s, the MaxSim score is

calculated as, where Fs,n is the F score defined previously for n-gram

For the entire corpus, the sim-score is just an arithmetic mean over all the individual sentence pairs score

Evaluation and Results

An alpha of 0.9 is used for these evaluations

References

1. Diana Perez and Enrique Alfonseca, Application of the Bleu algorithm for recognising textual entailments

2. Dan Roth, Recognizing Textual Entailment3. Yee Seng Chan and Hwee Tou Ng, MAXSIM:

An Automatic Metric for Machine Translation Evaluation Based on Maximum Similarity

References

4. Masaaki Tsuchida and Kai Ishikawa, IKOMA at TAC2011: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features5. Andrew Hickl and Jeremy Bensley, A Discourse Commitment-Based Framework for Recognizing Textual Entailment