1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s...
-
Upload
jane-hunter -
Category
Documents
-
view
212 -
download
0
Transcript of 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s...
![Page 1: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/1.jpg)
1
Text Similarity in NLP and its Applications
• Instructor: Paul Tarau, based on Rada Mihalcea’s original slides
![Page 2: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/2.jpg)
2
Why text similarity?
Used everywhere in NLP• Information retrieval (Query vs Document)
• Text classification (Document vs Category)
• Word-sense disambiguation (Context vs Context)
• Automatic evaluation Machine translation (Gold Standard vs
Generated) Text summarization (Summary vs Original)
![Page 3: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/3.jpg)
3
Word Similarity
![Page 4: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/4.jpg)
4
Word Similarity
• Finding similarity between words is a fundamental part of text similarity.
• Words can be similar if: They mean the same thing (synonyms) They mean the opposite (antonyms) They are used in the same way (red, green) They are used in the same context (doctor,
hospital, scalpel) One is a type of another (poodle, dog, mammal)
• Lexical hierarchies like WordNet can be useful.
![Page 5: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/5.jpg)
5
WordNet-like Hierarchy
wolf dog
animal
horse
amphibianreptilemammalfish
dachshund
hunting dogstallionmare
cat
terrier
![Page 6: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/6.jpg)
6
Knowledge-based word semantic similarity
• (Leacock & Chodorow, 1998)
• (Wu & Palmer, 1994)
• (Lesk, 1986) Finds the overlap between the dictionary entries of
two words
D
lengthsimlch *2
log
)()(
)(*2
21 conceptdepthconceptdepth
LCSdepthsim wu p
![Page 7: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/7.jpg)
7
Corpus-based + knowledge-based
• Based on information content
P(C) = probability of seeing a concept of type C in a large corpus = probability of seeing instances ofthat concept Determine the contribution of a word sense
based on the assumption of equal sense distributions: e.g. “plant” 50% occurrences are sense 1, 50% are
sense 2
))(log()( CPCIC
![Page 8: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/8.jpg)
8
Corpus-based + knowledge-based
• (Resnik, 1995)
• (Lin, 1998)
• (Jiang & Conrath, 1997)
)(LCSICsimres
)()(
)(*2
21 conceptICconceptIC
LCSICsimlin
)(*2)()(
1
21 LCSICco n cep tICco n cep tICsim jn c
![Page 9: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/9.jpg)
9
The Vectorial Model and Cosine Similarity
![Page 10: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/10.jpg)
10
Vectorial Similarity Model
• Imagine an N-dimensional space where N is the number of unique words in a pair of texts.
• Each of the two texts can be treated like a vector in this N-dimensional space.
• The distance between the two vectors is an indication of the similarity of the two texts.
• The cosine of the angle between the two vectors is the most common distance measure.
![Page 11: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/11.jpg)
11
Vector space modelExample:T1 = 2W1 + 3W2 + 5W3
T2 = 3W1 + 7W2 + W3
cos Ɵ = T1·T2 / (|T1|*|T2|
= 0.6758
W3
W1
W2
T1 = 2W1 + 3W2 + 5W3
T2 = 3W1 + 7W2 + W3
![Page 12: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/12.jpg)
12
Document similarityHurricane Gilbert swept toward the
Dominican Republic Sunday , and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.
The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph .
“There is no need for alarm," Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday .
Cabral said residents of the province of Barahona should closely follow Gilbert 's movement .
An estimated 100,000 people live in the province, including 70,000 in the city of Barahona , about 125 miles west of Santo Domingo .
Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night
The National Hurricane Center in Miami reported its position at 2a.m. Sunday at latitude 16.1 north , longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.
The National Weather Service in San Juan , Puerto Rico , said Gilbert was moving westward at 15 mph with a "broad area of cloudiness and heavy weather" rotating around the center of the storm.
The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6p.m. Sunday.
Strong winds associated with the Gilbert brought coastal flooding , strong southeast winds and up to 12 feet to Puerto Rico 's south coast.
![Page 13: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/13.jpg)
13
Document Vectors for selected terms
• Document1 Gilbert: 3 Hurricane: 2 Rains: 1 Storm: 2 Winds: 2
• Document2 Gilbert: 2 Hurricane: 1 Rains: 0 Storm: 1 Winds: 2
Cosine Similarity: 0.9439
![Page 14: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/14.jpg)
14
Problems with the simple model
• Common words improve the similarity too much The king is here vs The salad is cold Solution: Multiply raw counts by Inverse
Document Frequency (idf)
• Ignores semantic similarities I own a dog vs. I have a pet Solution: Supplement with Word Similarity
![Page 15: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/15.jpg)
15
Problems with the simple model (cont)
• Ignores syntactic relationships Mary loves John vs. John loves Mary Solution: Perform shallow SOV parsing
• Ignores semantic frames/roles Yahoo bought Flickr vs. Flickr was sold to
Yahoo Solution: Analyze verb classes
![Page 16: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/16.jpg)
16
Walk-through example
T1: When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs to him.
T2: When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him.
Paraphrase or not? - Compare similarity with threshold of 0.5
![Page 17: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/17.jpg)
17
Walk-through example
• Vector space model Cosine similarity = 0.45 not paraphrase
Text 1 Text 2defendant defendant 1.0 3.93
walked walked 1.0 1.58
turned turned 1.0 0.66backs backs 1.0 2.41
maxSim idf
T1: When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs to him.
T2: When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him.
![Page 18: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/18.jpg)
18
Walk-through example
• Semantic similarity measure Similarity = 0.80 paraphrase
Text 1 Text 2defendant defendant 1.0 3.93lawyer attorney 0.9 2.64walked walked 1.0 1.58court courthouse 0.6 1.06victims courthouse 0.4 2.11supporterscrowd 0.4 2.15turned turned 1.0 0.66backs backs 1.0 2.41
maxSim idf
T1: When the defendant and his lawyer walked into the court, some of the victim supporters turned their backs to him.
T2: When the defendant walked into the courthouse with his attorney, the crowd turned their backs on him.
![Page 19: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/19.jpg)
19
Pure Corpus-Based Approaches
![Page 20: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/20.jpg)
20
Corpus-based word semantic similarity
• Information exclusively derived from large corpora
• (Landauer 1998) Latent semantic analysis dimensionality reduction through SVD
• (Gabrilovich¸ Markovich 2007) Explicit semantic analysis uses Wikipedia concepts to define vector
space
![Page 21: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/21.jpg)
21
Latent Semantic Analysis
• Finds words that co-occur within a window of a few words and forms an NxN matrix.
• Mapped into k rows (k-dimensional space) using the SVD matrix operation.
• This technique learns related words due to their occurrence together in a context.
• Problem: Dimensions are not well defined.
![Page 22: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/22.jpg)
22
Explicit Semantic Analysis
• Determine the extent to which each word is associated with every concept of Wikipedia via term frequency or some other method.
• For a text, sum up the associated concept vectors for a composite text concept vector.
• Compare the texts using a standard cosine similarity or other vector similarity measure.
• Advantage: The vectors can be analyzed and tweaked because they are closely tied to Wikipedia concepts.
![Page 23: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/23.jpg)
23
ESA Example
• Text1: The dog caught the red ball.• Text2: A labrador played in the park.
• Similarity Score: 14.38%
Glossary of cue sports terms
American Football Strategy
Baseball Boston Red Sox
T1: 2711 402 487 528
T2: 108 171 107 74
![Page 24: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/24.jpg)
Why?
• http://en.wikipedia.org/wiki/Glossary_of_cue_sports_terms
![Page 25: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/25.jpg)
25
Automatic Student Answer Grading
![Page 26: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/26.jpg)
26
Class Grading Example
• a variable is a location in memory where a value can be stored
• a named object that can hold a numerical or letter value• it is a location in the computer 's memory where it can be
stored for use by a program• a variable is the memory address for a specific type of
stored data or from a mathematical perspective a symbol representing a fixed definition with changing values
• a location in memory where data can be stored and retrieved
Question: what is a variable?Answer: a location in memory that can store a value
Grader
5
3.5
5
5
5
![Page 27: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/27.jpg)
27
Class Grading Example
• a variable is a location in memory where a value can be stored
• a named object that can hold a numerical or letter value
• it is a location in the computer 's memory where it can be stored for use by a program
• a variable is the memory address for a specific type of stored data or from a mathematical perspective a symbol representing a fixed definition with changing values
• a location in memory where data can be stored and retrieved
Question: what is a variable?Answer: a location in memory that can store a value
Cosine Grader
0.724
5
0.04
0
3.5
0.31
6
5
0.106
5
0.304
5
![Page 28: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/28.jpg)
28
Class Grading Example
• a variable is a location in memory where a value can be stored
• a named object that can hold a numerical or letter value
• it is a location in the computer 's memory where it can be stored for use by a program
• a variable is the memory address for a specific type of stored data or from a mathematical perspective a symbol representing a fixed definition with changing values
• a location in memory where data can be stored and retrieved
Question: what is a variable?Answer: a location in memory that can store a value
LSA-Wiki Grader
0.901
5
0.212
3.5
0.86
9
5
0.536
5
0.839
5
![Page 29: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/29.jpg)
29
Class Grading Example
• a variable is a location in memory where a value can be stored
• a named object that can hold a numerical or letter value
• it is a location in the computer 's memory where it can be stored for use by a program
• a variable is the memory address for a specific type of stored data or from a mathematical perspective a symbol representing a fixed definition with changing values
• a location in memory where data can be stored and retrieved
Question: what is a variable?Answer: a location in memory that can store a value
ESA Grader
0.938
5
0.42
8
3.5
0.78
0
5
0.656
5
0.664
5
![Page 30: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/30.jpg)
30
Class Grading Example
• a variable is a location in memory where a value can be stored
• a named object that can hold a numerical or letter value
• it is a location in the computer 's memory where it can be stored for use by a program
• a variable is the memory address for a specific type of stored data or from a mathematical perspective a symbol representing a fixed definition with changing values
• a location in memory where data can be stored and retrieved
Question: what is a variable?Answer: a location in memory that can store a value
JCN Grader
0.768
5
0.41
3
3.5
0.77
8
5
0.550
5
0.661
5
![Page 31: 1 Text Similarity in NLP and its Applications Instructor: Paul Tarau, based on Rada Mihalcea’s original slides.](https://reader038.fdocuments.in/reader038/viewer/2022110321/56649cf45503460f949c1960/html5/thumbnails/31.jpg)
31
Some Problems
• Negation and Antonymy I like pizza vs I don't like pizza I ran the marathon very quickly vs I ran the
marathon slowly
• Semantic Role Reversal Dog bites man vs Man bites dog
• Logical Inconsistency/Too Much Information It's raining today vs It's raining today
because the sun is out