Text EntailmentDarsh ShahPratyaksh Sharma
Introduction to Textual Entailment Textual Entailment can be defined as the
phenomenon of inferring a text from another
A text t entails hypothesis h if h is true in every circumstance of possible world in which t is true
Definition Continued
This definition is very strict it requires truthfulness of h in all the instances where t is true
ExampleT: Sachin received an award for
batsmanship, from the ICC.H:The God of Cricket received an award.
Definition Continued
T entails H only when Sachin is Sachin Tendulkar. This is the more likely situation, but not true always.
So a modified definition is required Applied Definition:A text t entails
hypothesis h if human reading t will infer that h is mostly likely true
Mathematical Definition
Hypothesis h is entailed by text t if P(h is true|t) > P(h is true)
Where P(h is true | t) is the Entailment Confidence and can be considered as a measure of surety of entailment
Entailment Triggers
Semantic phenomena significant to Textual Entailment T: Sachin achieved the milestone of 100 centuries in his career.H: Sachin attained the milestone of 100 centuries in his career.The two words are synonyms
Generalizations or specializations of concepts in Text or Hypothesis can affect entailment
ExampleT: Sachin Tendulkar is a cricketer.H:Sachin Tendulkar is a sportsman.Here sportsman is a generalization of sportsman
Other triggers are of the form Verb Entailment, entailment through change of Quantifiers, could trigger entailment
Polarity, factivity, implicative verbs, iteratives could also lead to entailment
Applications of Textual Entailment
Many natural language processing applications, like Question Answering(QA), Information Extraction(IE), (multi-document) summarization and machine translation (MT) evaluation
Information Retrieval
Textual entailment impacts IR in at least two ways
Notion of relevance bears strong similarity with that of entailment
Textual entailment can be used to find affinities between various words, that can used to compute an extended similarity between documents and queries
Question Answering
A given text T is retrieved for Question Q All entities in text T are substituted as
potential answers to obtain candidate hypothesis H1,H2,...Hn.
We then pick the best entailed Hi for the given text T to be the answer for the question Q.
Machine Translation Evaluation
Machine Translation evaluation involves comparing the machine translated sentence with the reference output
Textual entailment helps in this case as it gives a measure of similarity of the information conveyed by the reference and machine output
Miscellaneous
Equivalence between two text pairs can be used by applying textual entailment from both sides. This is useful for novelty detection,copying detection etc
Text simplification, substituting complex phrases by simpler phrases producing sentences that are grammatically correct and convey the meaning in a simpler way
Some basic approaches Implemented
Plain word matching1. Calculate matching words between the text
and the hypothesis2. Score = (# matching_words)/(# words in
hypothesis)
Results
We need to calculate an entailment threshold, above which well declare entailment.We find the best accuracy giving threshold on the training set.With a threshold = 0.55, We get accuracy = 0.6138 on the RTE2 development set.
Test CasesT:The Rolling Stones kicked off their latest tour on Sunday with a concert at Boston's Fenway Park.
H:The Rolling Stones have begun their latest tour with a concert in Boston.
YesCorrectly identifies
Test CasesT:Craig Conway, fired as PeopleSoft's chief executive officer before the company was bought by Oracle, was in England last week.
H:Craig Conway works for Oracle.
NOFails to identify, calls a yes
Conclusion Too inaccurate a method
Cant differentiate between a sentence and its negation in the simplest form, might say that the entailment is true
Some basic approaches
Plain lemma matching1. Lemmatize the text and the hypothesis2. Calculate matching lemmas between the
two3. Score = (# matching_lemmas)/(# lemmas in
hypothesis)
Results
We need to calculate an entailment threshold, above which well declare entailment.
With a threshold = 0.63, We get accuracy = 0.625 on the RTE2 development set.
Test CaseH:Sunday's earthquake was felt in the southern Indian city of Madras on the mainland, as well as other parts of south India. The Naval meteorological office in Port Blair said it was the second biggest aftershock after the Dec. 26 earthquake.T:The city of Madras is located in Southern India.YESEntails correctly
Test CaseH:ECB spokeswoman, Regina Schueller, declined to comment on a report in Italy's La Repubblica newspaper that the ECB council will discuss Mr. Fazio's role in the takeover fight at its Sept. 15 meeting.T:Regina Schueller works for Italy's La Repubblica newspaper.NOEntails incorrectly
Observations Again not dependable for even moderately
complicated sentences
Some basic approaches
Lemma + POS matching1. Lemmatize the text and the hypothesis2. Label with POS tags3. Calculate number of matching (lemma,
POS_tag) between the two4. Score = (# matches)/(# lemmas in
hypothesis)
Results
We need to calculate an entailment threshold, above which well declare entailment.
With a threshold = 0.63, We get accuracy = 0.6225 on the RTE2 development set
Test CaseH:It is also an acronym that stands for Islamic Resistance Movement, a militant Islamist Palestinian organization that opposes the existence of the state of Israel and favors the creation of an Islamic state in Palestine.T:The Islamic Resistance Movement is also known as the Militant Islamic Palestinian Organization.NOFails to entail correctly
Some basic approaches
Using the BLEU algorithmBasically, the algorithm looks for n-gram coincidences between a candidate text.It can be used as a basic lexical level benchmark for other textual entailment methods
BLEU algorithm
For several values of N (typically from 1 to 4), calculate the percentage of n-grams from the hypothesis which appears in any of the text.
Combine the marks obtained for each value of N, as a weighted linear average.
BLEU algorithm Apply a brevity factor to penalise short texts
(which may have n-grams in common with the references, but may be incomplete).
Higher the BLEU score, higher the entailment.Learn a threshold for the bleu score from the development score.
Results from BLEULearned threshold = 0.0585Which means only 5.85% of n-gram matches and we declare entailment!
Still, accuracy on RTE2 development set with this parameter = 0.6050
Test CasesH:Patricia Amy Messier and Eugene W. Weaver were married May 28 at St. Clare Roman Catholic Church in North Palm Beach.T:Eugene W. Weaver is the husband of Patricia Amy.
YesEntails CorrectlyPossibly because of very low threshold used. Other systems fail to predict this
Conclusion
Fails to understand deep semantic relations of sentence pairs like the previous ones
It can be used as a baseline technique,quick to evaluate
A Discourse Commitment-Based Framework for Recognizing Textual Entailment
New framework for recognizing Textual Entailment, that depends on the set of publicly held beliefs known as discourse commitments- that can be ascribed to the author of a text or a hypothesis
Inspiration for the approach
Shallow approaches had been moderately successful in the previous 2 RTEs
These approaches would fail as the sentences became larger and more syntactically complex
Formal Definition of the Problem
Given a commitment set {ct} consisting of the set of discourse commitments inferable from a text t and a hypothesis h, define the task of RTE as a search for the commitment c {ct} which maximizes the likelihood that t textually entails h
System Architecture
Extracting Discourse Commitments
After preprocessing, some heuristics are used to extract discourse commitments
Sentence Segmentation,Syntactic Decomposition,Supplementary Expressions,Relational Extraction, Coreference Resolution
Commitment Selection
Following Commitment Extraction,a word alignment technique first introduced in (Taskar et al., 2005b) was used in order to select the commitment extracted from t (henceforth, ct) which represents the best alignment for each of the commitments extracted from h (henceforth, ch)
The alignment of two discourse commitments can be cast as a maximum weighted matching problem in which each pair of words (ti ,hj ) in an commitment pair (ct ,ch) is assigned a score sij (t, h) corresponding to the likelihood that ti is aligned to hj
In order to compute a set of parameters w which maximize the number of correct alignment predictions (y) in a given training set (x)
Features used in the model
string features (including Levenshtein edit distance, string equality, and stemmed string equality)
lexico-semantic features (including WordNet Similarity and named entity similarity equality)
word association features
Following alignment,the method uses the sum of the edge scores
Search for ct that represents the reciprocal best hit
That is,selecting a commitment pair (ct , ch) where ct was the top scoring alignment candidate for ch and ch was the top-scoring alignment candidate for ct
Entailment and Results
Textual entailment selection is done based on the decision tree shown in the system
architecture The following shows the results on RTE-3
test dataset
IKOMA
One of the best performing submissions in RTE-7 (Text Analysis Conference 2011)
Title: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features
Had the highest F-measure (48.00) on the dataset. Next best was 45.13
Approach
First, calculate an entailment score based on lexical-level matching.Combine it with machine learning based filtering using various features obtained from lexical-level, chunk-level and predicate argument structure-level information.
Approach
Role of filtering: to discard T-H pairs that have high entailment score but are not actually. Using higher features than lexical level.
SENNA is used for analyzing POS of words, word chunks, NER and predicate-argument structures
Knowledge resources used
Acronyms extracted from the corpus: created for organizational names with more than three words.
WordNet CatVar: contains categorical variations of
English lexemes.
Lexical Entailment Score
R: set of knowlegde resources. Tt and Ht = set of words in each T and H.
freq(t) is frequency of t in corpus.
Lexical Entailment Score
match(t, Tt, R) takes 1 if word t corresponds to a word in Tt (also consider synonyms and derived words from R); otherwise match() takes the value 0.
Lexical Entailment Score
The Lexical Entailment Score is calculated for all H-T pairs in the development set and a threshold is chosen which gives the highest micro-average F-measure. Experiments are also done to find the optimum value of in equation (1). By testing: we find = 1.8 to be optimal
Filtering stage
We train a model that classifies T-H pairs having high LES into false-positive or true-positive. If the model predicts a T-H pair as false-positive, then we discard that pair from entailment T-H pair candidates.
Features for classifier
The lib-svm package is used, with features like: lexical-level:
Entailment Score ent_sc Cosine similarity Entailment score, comparing only words with same
POS tag
Features for classifier
Chunk level Matching ratios for each chunk types (e.g. NP and
VP) in all corresponding chunk pairs PAS level
Matching ratio for each argument type (A0, A1) in all corresponding PAS pairs for each semantic relation of two predicates
Features for classifier
Chunk level Matching ratios for each chunk types (e.g. NP and
VP) in all corresponding chunk pairs PAS level: For all corresponding PAS pairs:
Matching ratio for each argument type (A0, A1) Number of negation mismatch Number of modal verb mismatch Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment.
Features for classifier Chunk level
Matching ratios for each chunk types (e.g. NP and VP) in all corresponding chunk pairs
PAS level: For all corresponding PAS pairs: Matching ratio for each argument type (A0, A1)Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment or no-relation
Computing the features
For acquiring the above features in chunk and PAS level, we need to detect corresponding pairs that should be checked for testing whether the pairs have entailment.
Also need to detect whether such corresponding pairs are in entailment relation.
For the first problem
1. Transform all words contained in PAS into a word vector using bag of words representation
2. Calculate the cosine similarity for all PAS pairs that are generated by combining PAS from each T and H.
3. We regard the most similar PAS from T for each PAS from H as corresponding pairs.
For the latter problem
1. For each corresponding pair, we calculate our lexical entailment score between the words of each argument type of the PAS from H (as H in equation 1) and the words of the same argument type of the PAS from T (as T in equation 1)
2. Apply a threshold (pre-defined) to identify entailment
Results
Three solver were submitted: 1. IKOMA1: lexical entailment score + filtering
with threshold set empirically2. IKOMA2: same as IKOMA1 with threshold 03. IKOMA3: lexical entailment score only
Results
MaxSim: An automatic metric for Machine Translation Evaluation based on maximum similarity The metric calculates a similarity score
between a pair of English system-reference sentences by comparing information items such as n-grams across the sentence pair
Unlike most metrics, MAXSIM computes a similarity score between items
Then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence
Evaluation on the WMT07, WMT08, and MT06 datasets show that MAXSIM achieves good correlations with human judgment
Given a pair of English sentences to be compared, MaxSim performs tokenization, lemmatization using Word-Net and Part of Speech (POS) tagging
Next, all non-alphanumeric tokens are removed
Set of wordnet synonyms are gathered for each word, which are used for computing similarity
To calculate a similarity score for a pair of system reference translation sentences, MAXSIM extracts and compares n-gram information
Based on these comparisons or matches across the sentence pairs, MaxSim computes precision and recall
Matching Using N-gram Information
Phases of n-grams
To match n-grams, MAXSIM goes through a sequence of three phases: lemma and POS matching, lemma matching, and bipartite graph matching
We will illustrate the matching process using unigrams, then describe the extension to bigrams and trigrams
Lemma and POS-tag matching:An exact match on n-gram and POS-tag is applied
In all n-gram matching, each n-gram in the system translation can only match at most one n-gram in the reference translation
Lemma Matching: For the remaining unmatched n-grams, a relaxed condition of just lemma match is used
Bipartite Graph Matching: For the remaining
unmatched unigrams, matches are made by constructing a weighted complete bipartite graph The remaining unigrams form the nodes of
the graph The weights are the a sum of the wordnet
similarity between two word nodes and the identity function on whether or not they have the same POS tag
Calculation of F-score
Scoring a Sentence Pair and the Whole Corpus For a sentence pair s, the MaxSim score is
calculated as, where Fs,n is the F score defined previously for n-gram
For the entire corpus, the sim-score is just an arithmetic mean over all the individual sentence pairs score
Evaluation and Results
An alpha of 0.9 is used for these evaluations
References
1. Diana Perez and Enrique Alfonseca, Application of the Bleu algorithm for recognising textual entailments
2. Dan Roth, Recognizing Textual Entailment3. Yee Seng Chan and Hwee Tou Ng, MAXSIM:
An Automatic Metric for Machine Translation Evaluation Based on Maximum Similarity
References
4. Masaaki Tsuchida and Kai Ishikawa, IKOMA at TAC2011: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features5. Andrew Hickl and Jeremy Bensley, A Discourse Commitment-Based Framework for Recognizing Textual Entailment
Top Related