Scott Settembre [email protected] CSE 734 : Cyber Physical Spaces.
CSE467/567 Computational Linguistics Carl Alphonce [email protected] Computer Science...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of CSE467/567 Computational Linguistics Carl Alphonce [email protected] Computer Science...
![Page 1: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/1.jpg)
CSE467/567Computational Linguistics
Carl [email protected]
Computer Science & Engineering
University at Buffalo
![Page 2: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/2.jpg)
Fall 2006 CSE 467/5672
Levels of processing
phonetics/phonology – sounds morphology – word structure syntax – sentence structure semantics – meaning pragmatics – goals of language use discourse – utterances in context
![Page 3: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/3.jpg)
Fall 2006 CSE 467/5673
Words: the building blocks of sentences
the
D
d og
N
N P
ch ased
V
the
D
cat
N
N P
V P
S
![Page 4: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/4.jpg)
Fall 2006 CSE 467/5674
Words have internal structure
readable = read + able readability = read + able + ity
the structure of words can be described using a regular grammar
![Page 5: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/5.jpg)
Fall 2006 CSE 467/5675
Chomsky hierarchy
regularlanguages
context-freelanguages
context-sensitivelanguages
unrestrictedlanguages
![Page 6: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/6.jpg)
Fall 2006 CSE 467/5676
Problem
I often need to find an e-mail, but I have thousands of e-mails in my various folders. Suppose I want to find an e-mail about geese. The e-mail may mention “geese” or “goose”; also, if it appears at the start of a sentence, its initial letter will be capitalized. Need to match “goose”, “geese”, “Goose” or “Geese”.
![Page 7: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/7.jpg)
Fall 2006 CSE 467/5677
Regular expressions (in Perl)
“a regular expression is an algebraic notation for characterizing a set of strings” [p. 22]
Regular expressions are commonly used to specify search strings. For example, the UNIX utility program grep lets the user specify a pattern to search for in files.
![Page 8: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/8.jpg)
Fall 2006 CSE 467/5678
Sequences of characters
Matching a sequence of characters/…/
Examples:/a/ matches the character ‘a’/fred/ matches the string ‘fred’
Note:/fred/ does not match the string ‘Fred’!
In other words, patterns are case-sensitive.
![Page 9: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/9.jpg)
Fall 2006 CSE 467/5679
Character disjunction(character classes)
Square brackets are used to indicate disjunction of characters.
Examples:/[Ff]/ matches either ‘f’ or ‘F’/[Ff]red/ matches either ‘fred’ or ‘Fred’
This form of disjunction applies only at the character level. A set of characters in square brackets are sometimes referred to as a character class.
![Page 10: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/10.jpg)
Fall 2006 CSE 467/56710
Ranges
Sometimes it is useful to specify “any digit” or “any letter”.
“Any digit” can be written as /[0123456789]/, since any of the ten digits satisfies the pattern.
An alternative is to use a special range notation: /[0-9]/
Any letter can be specified as /[A-Za-z]/
Range notation does not extend the power of regular expressions, but gives us a convenient way to express them.
![Page 11: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/11.jpg)
Fall 2006 CSE 467/56711
Complementing character classes
To search for a character that is not in a character class, use the caret (^) in front of the character class that is enclosed in square brackets.
Examples:
/[^a]/ matches anything except ‘a’
/[^0-9]/ matches anything except a digit
![Page 12: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/12.jpg)
Fall 2006 CSE 467/56712
Matching 0 or 1 occurrence
The ‘?’ matches zero or one occurrences of the preceding expression.
Examples:/a?/ matches ‘a’ or ‘’ (nothing)/cats?/ matches ‘cat’ or ‘cats’Note that the “preceding expression”, in these examples, is a single letter. We’ll see how to form longer expressions later.
![Page 13: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/13.jpg)
Fall 2006 CSE 467/56713
The Kleene star and plus
The Kleene star (*) matches zero or more occurrences of the preceding expression.
Examples:/a*/ matches ‘’, ‘a’, ‘aa’, ‘aaa’, etc./[ab]*/ matches ‘’, ‘a’, ‘b’, ‘aa’, ‘ab’, ‘ba’, ‘bb’, etc.
+ matches one or more occurrences+ is not necessary: /[ab]+/ is equiv. to /[ab][ab]*/
![Page 14: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/14.jpg)
Fall 2006 CSE 467/56714
Wildcard
The period (.) matches any single character except the newline (\n).
![Page 15: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/15.jpg)
Fall 2006 CSE 467/56715
Anchors
Anchors are used to restrict a match to a particular position within a string.
^ anchors to the start of a string$ anchors to the end of a string
/[Ff]red/ matches both ‘Fred’ and ‘Fred is home’ /^[Ff]red$/ matches ‘Fred’ but not ‘Fred is home’
\b anchors to a word boundary\B anchors to a non-boundary
![Page 16: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/16.jpg)
Fall 2006 CSE 467/56716
Conjunction
Two regular expressions are conjoined by juxtaposition (placing the expressions side by side).
Examples:
/a/ matches ‘a’
/m/ matches ‘m’
/am/ matches ‘am’ but not ‘a’ or ‘m’ alone
![Page 17: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/17.jpg)
Fall 2006 CSE 467/56717
Disjunction
We have already seen disjunction of characters using the square bracket notation
General disjunction is expressed using the vertical bar (|), also called the pipe symbol.
This form of disjunction allows us to match any one of the alternative patterns, not just characters like the [ ] disjunction form.
![Page 18: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/18.jpg)
Fall 2006 CSE 467/56718
Grouping
Parentheses, ‘(’ and ‘)’, are used to group subpatterns of a larger pattern.
Ex: /[Gg](ee)|(oo)se/
![Page 19: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/19.jpg)
Fall 2006 CSE 467/56719
Replacement
In addition to matching, we can do replacements when a match is found:
Example:To replace the British spelling of color with the American spelling, we can write:
s/colour/color/
![Page 20: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/20.jpg)
Fall 2006 CSE 467/56720
Registers – saving matches
To save a match from part of a pattern, to reuse it later on, Perl provides registers
Registers are named \#, where # is the number of the register Ex.
DE DO DO DO DE DA DA DAIS ALL I WANT TO SAY TO YOU
/(D[AEO].)*/ will match the first line
/(D[AEO])(.D[AEO]) \2 \2\s \1 (.D[AEO]) \3 \3/ matches it more specifically
This pattern also matches strings like DA DE DE DE DA DO DO DO
\s matches a whitespace character
![Page 21: CSE467/567 Computational Linguistics Carl Alphonce cse-467-alphonce@cse.buffalo.edu Computer Science & Engineering University at Buffalo.](https://reader036.fdocuments.in/reader036/viewer/2022062407/56649d815503460f94a65f98/html5/thumbnails/21.jpg)
Fall 2006 CSE 467/56721
For more information
PERL Regular Expression TUTorial– http://perldoc.perl.org/perlretut.html
PERL Regular Expression reference page– http://perldoc.perl.org/perlre.html