Scanning & Regular Expressions

Scanning & Regular Expressions

CPSC 388Ellen WalkerHiram College

Scanning

• Input: characters from the source code

• Output: Tokens– Keywords: IF, THEN, ELSE, FOR …– Symbols: PLUS, LBRACE, SEMI …– Variable tokens: ID, NUM

•Augment with string or numeric value

TokenType

• Enumerated type (a c++ construct)Typedef enum {IF, THEN, ELSE …} TokenType

• IF, THEN, ELSE (etc) are now literals of type TokenType

Using TokenType

void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }

Token Class (partial)

class Token {public: TokenType tokenval; string tokenchars; double numval;}

Interlude: References and Pointers

• Java has primitives and references– Primitives are int, char, double, etc.

– References “point to” objects

• C++ has only primitives– But, one of the primitives is “address”, which serves the purpose of a reference.


• To declare a pointer, put * after the typechar x; // a characterchar *y; // a pointer to a character

• Using pointers:x = ‘a’; y = &x; //y gets the address of x*y = ‘b’; //thing pointed at by y becomes ‘b’;

//note that x is now also b!


• Continuing the example…cout << x << endl; // prints bcout << *y << endl; // prints bcout << y << endl; // prints a hex address

cout << &x << endl; // same as abovecout << &y << endl; // a different address - where the pointer is stored

GetToken(): A scanning function

• Token *getToken(istream &sin)– Read characters from sin until a complete token is extracted, return (a pointer to) the token

– Usually called by the parser

– Note: version in the book uses global variables and returns only the token type

Using GetToken

Token *myToken = GetToken(cin);while (myToken != NULL){ //process the token

switch (myToken->TokenType){ //cases for each token type }

myToken = GetToken(cin);

}

Result of GetToken

for (int i = 0 ; i < 100 ; i++){

for (int i = 0 ; i < 100 ; i++){

for (int i = 0 ; i < 100 ; i++){

TokenType: FOR

TokenType: LPAREN

Tokens and Languages

• The set of valid tokens of a particular type is a Language (in the formal sense)

• More specifically, it is a Regular Language

Language Formalities

• Language: set of strings • String: sequence of symbols• Alphabet: set of legal symbols for strings– Generally is used to denote an alphabet

Example Languages

• L1 = {aa, ab, bb} , = {a, b}• L2 = {,ab, abab, … }, = {a, b}• L3 = {strings of N a’s where N is an odd integer}, = {a}

• L4 = { } (one string with no symbols)

• L5 = { } (no strings at all)• L5 = Ø

Denoting Languages

• Expressions (regular languages only)

• Grammars– Set of rewrite rules that express all and only the strings in the language

• Automata– Machines that “accept” all and only the strings in the language

Primitive Regular Expressions

– L() = {} (no strings)

• – L() = {} (one string, no symbols)

• a where a is a member of – L(a) = {a} (one string, one symbol)

Combining Regular Expressions

• Choice: r | s (sometimes r+s)– L(r | s) = L(r ) L(s)

• Concatenation: rs – L(rs) = L(r)L(s) – All combinations of 1 from r and 1 from s

• Repetition: r*– L(r*) = L(r )L(rr) L(rrr ) …– 0 or more strings from r concatenated

Precedence

• Repetition before concatenation

• Concatenation before choice• Use parentheses to override

• aa* vs. (aa)*• ab|c vs. a(b|c)

Example Languages

• L1 = {aa, ab, bb} , = {a, b}• L2 = {,ab, abab, … }, S = {a, b}• L3 = {strings of N a’s where N is an odd integer}, S = {a}

• L4 = { } (one string with no symbols)

• L5 = { } (no strings at all)• L5 = Ø

R.E.’s for Examples

• L1 = aa | ab | bb• L1 = a(a|b) | bb• L1 = aa | (a|b) b• L2 = (ab)* not ab* !• L3 = a(aa)*

What are these languages?

• a* | b* | c*• a*b*c*• (a*b*)*• a(a|b)*c• (a|b|c)*bab(a|b|c)*

What are the RE’s?

• In the alphabet {a,b,c}:– All strings that are in alphabetical order

– All strings that have the first a before the first b, before the first c, e.g. ababbabca

– All strings that contain “abc”– All strings that do not contain “abc”

Extended Reg. Exp’s

• Additional operations for conveniencer+ = rr* (one or more reps). ( any character in the alphabet)

.* = any possible string from the alphabet

[a-z] = a|b|c|…|z[^aeiou] = b|c|d|f|g|h|j...

Scanning & Regular Expressions

Documents

Transcript of Scanning & Regular Expressions