Scanning & Regular Expressions
description
Transcript of Scanning & Regular Expressions
Scanning & Regular Expressions
CPSC 388Ellen WalkerHiram College
Scanning
• Input: characters from the source code
• Output: Tokens– Keywords: IF, THEN, ELSE, FOR …– Symbols: PLUS, LBRACE, SEMI …– Variable tokens: ID, NUM
•Augment with string or numeric value
TokenType
• Enumerated type (a c++ construct)Typedef enum {IF, THEN, ELSE …} TokenType
• IF, THEN, ELSE (etc) are now literals of type TokenType
Using TokenType
void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }
Token Class (partial)
class Token {public: TokenType tokenval; string tokenchars; double numval;}
Interlude: References and Pointers
• Java has primitives and references– Primitives are int, char, double, etc.
– References “point to” objects
• C++ has only primitives– But, one of the primitives is “address”, which serves the purpose of a reference.
Interlude: References and Pointers
• To declare a pointer, put * after the typechar x; // a characterchar *y; // a pointer to a character
• Using pointers:x = ‘a’; y = &x; //y gets the address of x*y = ‘b’; //thing pointed at by y becomes ‘b’;
//note that x is now also b!
Interlude: References and Pointers
• Continuing the example…cout << x << endl; // prints bcout << *y << endl; // prints bcout << y << endl; // prints a hex address
cout << &x << endl; // same as abovecout << &y << endl; // a different address - where the pointer is stored
GetToken(): A scanning function
• Token *getToken(istream &sin)– Read characters from sin until a complete token is extracted, return (a pointer to) the token
– Usually called by the parser
– Note: version in the book uses global variables and returns only the token type
Using GetToken
Token *myToken = GetToken(cin);while (myToken != NULL){ //process the token
switch (myToken->TokenType){ //cases for each token type }
myToken = GetToken(cin);
}
Result of GetToken
for (int i = 0 ; i < 100 ; i++){
for (int i = 0 ; i < 100 ; i++){
for (int i = 0 ; i < 100 ; i++){
TokenType: FOR
TokenType: LPAREN
Tokens and Languages
• The set of valid tokens of a particular type is a Language (in the formal sense)
• More specifically, it is a Regular Language
Language Formalities
• Language: set of strings • String: sequence of symbols• Alphabet: set of legal symbols for strings– Generally is used to denote an alphabet
Example Languages
• L1 = {aa, ab, bb} , = {a, b}• L2 = {,ab, abab, … }, = {a, b}• L3 = {strings of N a’s where N is an odd integer}, = {a}
• L4 = { } (one string with no symbols)
• L5 = { } (no strings at all)• L5 = Ø
Denoting Languages
• Expressions (regular languages only)
• Grammars– Set of rewrite rules that express all and only the strings in the language
• Automata– Machines that “accept” all and only the strings in the language
Primitive Regular Expressions
– L() = {} (no strings)
• – L() = {} (one string, no symbols)
• a where a is a member of – L(a) = {a} (one string, one symbol)
Combining Regular Expressions
• Choice: r | s (sometimes r+s)– L(r | s) = L(r ) L(s)
• Concatenation: rs – L(rs) = L(r)L(s) – All combinations of 1 from r and 1 from s
• Repetition: r*– L(r*) = L(r )L(rr) L(rrr ) …– 0 or more strings from r concatenated
Precedence
• Repetition before concatenation
• Concatenation before choice• Use parentheses to override
• aa* vs. (aa)*• ab|c vs. a(b|c)
Example Languages
• L1 = {aa, ab, bb} , = {a, b}• L2 = {,ab, abab, … }, S = {a, b}• L3 = {strings of N a’s where N is an odd integer}, S = {a}
• L4 = { } (one string with no symbols)
• L5 = { } (no strings at all)• L5 = Ø
R.E.’s for Examples
• L1 = aa | ab | bb• L1 = a(a|b) | bb• L1 = aa | (a|b) b• L2 = (ab)* not ab* !• L3 = a(aa)*
What are these languages?
• a* | b* | c*• a*b*c*• (a*b*)*• a(a|b)*c• (a|b|c)*bab(a|b|c)*
What are the RE’s?
• In the alphabet {a,b,c}:– All strings that are in alphabetical order
– All strings that have the first a before the first b, before the first c, e.g. ababbabca
– All strings that contain “abc”– All strings that do not contain “abc”
Extended Reg. Exp’s
• Additional operations for conveniencer+ = rr* (one or more reps). ( any character in the alphabet)
.* = any possible string from the alphabet
[a-z] = a|b|c|…|z[^aeiou] = b|c|d|f|g|h|j...