LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 ·...
Transcript of LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 ·...
![Page 1: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/1.jpg)
LING/C SC 581: Advanced Computa9onal Linguis9cs
Lecture Notes Jan 22nd
![Page 2: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/2.jpg)
Today's Topics
• Minimum Edit Distance Homework
• Corpora: frequency informa9on
• tregex
![Page 3: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/3.jpg)
Minimum Edit Distance Homework
• Background: – … about 20% of the -me “Britney Spears” is misspelled when people search for it on Google
• SoOware for genera9ng misspellings – If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings.
– hRp://www.geneffects.com/typoposi9ve/
![Page 4: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/4.jpg)
Minimum Edit Distance Homework
• hRp://www.google.com/jobs/archive/britney.html
Top six misspellings
• Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(briRany) < ED(britany)
![Page 5: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/5.jpg)
Minimum Edit Distance Homework
• Submit your homework in PDF – how many you got right – explain your criteria, e.g. weights, chosen
• you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well
• due by email to me before next Thursday class… – put your name and 581 at the top of your submission
![Page 6: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/6.jpg)
Part 2
• Corpora: frequency informa9on
• Unlabeled corpus: just words • Labeled corpus: various kinds … – POS informa9on – Informa9on about phrases – Word sense or Seman9c role labeling
easy to find
progressively harder to create or obtain
![Page 7: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/7.jpg)
Language Models and N-‐grams • given a word sequence
– w1 w2 w3 ... wn • chain rule
– how to compute the probability of a sequence of words – p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ... – p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-‐2 wn-‐1)
• note – It’s not easy to collect (meaningful) sta9s9cs on p(wn|wn-‐1wn-‐2...w1) for all
possible word sequences
![Page 8: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/8.jpg)
Language Models and N-‐grams • Given a word sequence
– w1 w2 w3 ... wn • Bigram approxima8on
– just look at the previous word only (not all the proceedings words) – Markov Assump8on: finite length history – 1st order Markov Model – p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-‐3wn-‐2wn-‐1)
– p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)
• note – p(wn|wn-‐1) is a lot easier to collect data for (and thus es9mate well) than p(wn|w1...wn-‐2
wn-‐1)
![Page 9: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/9.jpg)
Language Models and N-‐grams • Trigram approxima8on
– 2nd order Markov Model – just look at the preceding two words only – p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|
w1...wn-‐3wn-‐2wn-‐1)
– p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-‐2 wn-‐1)
• note – p(wn|wn-‐2wn-‐1) is a lot easier to es9mate well than p(wn|w1...wn-‐2 wn-‐1) but
harder than p(wn|wn-‐1 )
![Page 10: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/10.jpg)
Language Models and N-‐grams
• es8ma8ng from corpora – how to compute bigram probabili-es – p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1w) w is any word
– Since f(wn-‐1w) = f(wn-‐1) f(wn-‐1) = unigram frequency for wn-‐1
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency
• Note: – The technique of es9ma9ng (true) probabili9es using a rela9ve
frequency measure over a training corpus is known as maximum likelihood es8ma8on (MLE)
![Page 11: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/11.jpg)
Mo9va9on for smoothing • Smoothing: avoid zero probability es-mates • Consider • what happens when any individual probability component is
zero? – Arithme8c mul8plica8on law: 0×X = 0 – very bri>le!
• even in a very large corpus, many possible n-‐grams over vocabulary space will have zero frequency – par-cularly so for larger n-‐grams
p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)
![Page 12: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/12.jpg)
Language Models and N-‐grams
• Example:
unigram frequencies
wn-‐1wn bigram frequencies
bigram probabili9es
sparse matrix
zeros render probabili9es unusable
(we’ll need to add fudge factors -‐ i.e. do smoothing)
wn-‐1
wn
![Page 13: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/13.jpg)
Smoothing and N-‐grams • sparse dataset means zeros are a problem
– Zero probabili9es are a problem • p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1) bigram model
• one zero and the whole product is zero – Zero frequencies are a problem
• p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency
• bigram f(wn-‐1wn) doesn’t exist in dataset
• smoothing – refers to ways of assigning zero probability n-‐grams a non-‐zero value
![Page 14: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/14.jpg)
Smoothing and N-‐grams • Add-‐One Smoothing (4.5.1 Laplace Smoothing)
– add 1 to all frequency counts – simple and no more zeros (but there are bePer methods)
• unigram – p(w) = f(w)/N (before Add-‐One)
• N = size of corpus – p(w) = (f(w)+1)/(N+V) (with Add-‐One) – f*(w) = (f(w)+1)*N/(N+V) (with Add-‐One)
• V = number of dis9nct words in corpus • N/(N+V) normaliza9on factor adjus9ng for the effec9ve increase in the corpus size caused by
Add-‐One • bigram
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) (before Add-‐One) – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) (aRer Add-‐One) – f*(wn-‐1 wn) = (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V) (aRer Add-‐One)
must rescale so that total probability mass stays at 1
![Page 15: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/15.jpg)
Smoothing and N-‐grams
• Add-‐One Smoothing – add 1 to all frequency counts
• bigram – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) – (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)
• frequencies
Remarks: perturba-on problem add-‐one causes large changes in some frequencies due to rela9ve size of V (1616) want to: 786 ⇒ 338
I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0
I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22
= figure 6.8
= figure 6.4
![Page 16: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/16.jpg)
Smoothing and N-‐grams
• Add-‐One Smoothing – add 1 to all frequency counts
• bigram – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) – (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)
• Probabili8es
Remarks: perturba-on problem similar changes in probabili9es
I want to eat Chinese food lunchI 0.00178 0.21532 0.00020 0.00277 0.00020 0.00020 0.00020want 0.00141 0.00035 0.27799 0.00035 0.00247 0.00318 0.00247to 0.00082 0.00021 0.00226 0.17672 0.00082 0.00021 0.00267eat 0.00039 0.00039 0.00117 0.00039 0.00783 0.00117 0.02075Chinese 0.00164 0.00055 0.00055 0.00055 0.00055 0.06616 0.00109food 0.00641 0.00032 0.00577 0.00032 0.00032 0.00032 0.00032lunch 0.00241 0.00048 0.00048 0.00048 0.00048 0.00096 0.00048
I want to eat Chinese food lunchI 0.00233 0.31626 0.00000 0.00378 0.00000 0.00000 0.00000want 0.00247 0.00000 0.64691 0.00000 0.00494 0.00658 0.00494to 0.00092 0.00000 0.00307 0.26413 0.00092 0.00000 0.00369eat 0.00000 0.00000 0.00213 0.00000 0.02026 0.00213 0.05544Chinese 0.00939 0.00000 0.00000 0.00000 0.00000 0.56338 0.00469food 0.01262 0.00000 0.01129 0.00000 0.00000 0.00000 0.00000lunch 0.00871 0.00000 0.00000 0.00000 0.00000 0.00218 0.00000
= figure 6.5
= figure 6.7
![Page 17: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/17.jpg)
Smoothing and N-‐grams
• let’s illustrate the problem – take the bigram case: – wn-‐1wn
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1)
– suppose there are cases – wn-‐1wzero
1 that don’t occur in the corpus
probability mass
f(wn-‐1)
f(wn-‐1wn)
f(wn-‐1wzero1)=0
f(wn-‐1wzerom)=0
...
![Page 18: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/18.jpg)
Smoothing and N-‐grams
• add-‐one – “give everyone 1”
probability mass
f(wn-‐1)
f(wn-‐1wn)+1
f(wn-‐1w01)=1
f(wn-‐1w0m)=1
...
![Page 19: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/19.jpg)
Smoothing and N-‐grams
• add-‐one – “give everyone 1”
probability mass
f(wn-‐1)
f(wn-‐1wn)+1
f(wn-‐1w01)=1
f(wn-‐1w0m)=1
... V = |{wi}|
• redistribu-on of probability mass – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V)
![Page 20: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/20.jpg)
Smoothing and N-‐grams
• Good-‐Turing Discoun8ng (4.5.2) – Nc = number of things (= n-‐grams) that occur c 9mes in the corpus – N = total number of things seen – Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc
– Idea: use frequency of things seen once to es9mate frequency of things we haven’t seen yet – es9mate N0 in terms of N1… – and so on but if Nc =0, smooth that first using something like log(Nc)=a+b log(c) – Formula: P*(things with zero freq) = N1/N – smaller impact than Add-‐One
• Textbook Example: – Fishing in lake with 8 species
• bass, carp, cawish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species):
• 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel
– P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 – P(next fish=trout) = 1/18
• (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037
![Page 21: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/21.jpg)
Language Models and N-‐grams
• N-‐gram models + smoothing – one consequence of smoothing is that – every possible concatenta9on or sequence of words has a non-‐zero probability
– N-‐gram models can also incorporate word classes, e.g. POS labels when available
![Page 22: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/22.jpg)
Language Models and N-‐grams
• N-‐gram models – data is easy to obtain • any unlabeled corpus will do
– they’re technically easy to compute • count frequencies and apply the smoothing formula
– but just how good are these n-‐gram language models?
– and what can they show us about language?
![Page 23: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/23.jpg)
Language Models and N-‐grams
approxima8ng Shakespeare – generate random sentences using n-‐grams – Corpus: Complete Works of Shakespeare
• Unigram (pick random, unconnected words)
• Bigram
![Page 24: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/24.jpg)
Language Models and N-‐grams
• Approxima9ng Shakespeare – generate random sentences using n-‐grams – Corpus: Complete Works of Shakespeare
• Trigram
• Quadrigram
Remarks: dataset size problem training set is small 884,647 words 29,066 different words
29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible con9nua9ons, which means program can’t be very innova9ve for higher n
![Page 25: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/25.jpg)
Language Models and N-‐grams
• A limita9on: – produces ungramma9cal sequences
• Treebank: – poten9al to be a beRer language model – Structural informa9on: • contains frequency informa9on about syntac9c rules
– we should be able to generate sequences that are closer to English …
![Page 26: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/26.jpg)
Language Models and N-‐grams
• Aside: hRp://hemispheresmagazine.com/contests/2004/intro.htm
![Page 27: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/27.jpg)
Part 3
tregex • I assume everyone has:
1. Installed Penn Treebank v3 2. Downloaded and installed tregex
![Page 28: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/28.jpg)
Trees in the Penn Treebank
Nota8on: LISP S-‐expression
Directory: TREEBANK_3/parsed/mrg/
![Page 29: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/29.jpg)
tregex • Search Example: << dominates, < immediately dominates
![Page 30: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/30.jpg)
tregex Help
![Page 31: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/31.jpg)
tregex Help
![Page 32: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/32.jpg)
tregex • Help: tregex expression syntax is non-‐standard wrt bracke-ng
S < VP S < NP
![Page 33: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/33.jpg)
tregex
• Help: tregex boolean syntax is also non-‐standard
![Page 34: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/34.jpg)
tregex
• Help
![Page 35: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/35.jpg)
tregex • Help
![Page 36: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/36.jpg)
tregex • PaRern:
– (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <-‐ =comma)
Key: <, first child $+ immediate leO sister <-‐ last child
same node
![Page 37: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/37.jpg)
tregex
• Help
![Page 38: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/38.jpg)
tregex
![Page 39: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/39.jpg)
tregex • Different results from:
– @SBAR < /^WH.*-‐([0-‐9]+)$/#1%index << (@NP < (/^-‐NONE-‐/ < /^\*T\*-‐([0-‐9]+)$/#1%index))
![Page 40: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/40.jpg)
tregex
Example: WHADVP also possible (not just WHNP)
![Page 41: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/41.jpg)
Treebank Guides 1. Tagging Guide 2. Arpa94 paper 3. Parse Guide
![Page 42: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/42.jpg)
Treebank Guides
• Parts-‐of-‐speech (POS) Tagging Guide, tagguid1.pdf (34 pages):
tagguid2.pdf: addendum, see POS tag ‘TO’
![Page 43: LING/C’SC’581:’’sandiway/ling581-15/... · 2015-01-22 · Minimum’EditDistance’Homework’ • Submityour’homework’in’PDF’ – how’many’you’gotright –](https://reader033.fdocuments.in/reader033/viewer/2022042120/5e9a5748fa82c939d82f4204/html5/thumbnails/43.jpg)
Treebank Guides
• Parsing guide 1, prsguid1.pdf (318 pages):
prsguid2.pdf: addendum for the Switchboard corpus