Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.
-
Upload
margarita-dimsdale -
Category
Documents
-
view
215 -
download
1
Transcript of Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker.
Lexical Analysis (4.2)
Programming LanguagesHiram CollegeEllen Walker
Lexical Analysis is Pattern Matching
• From a sequence of characters to a sequence of lexemes, e.g.– “public static void main(char[] args)” ->– <id> <id> <id> <id> <lparen> <id> <lsquare>
<rsquare> <id> <rparen>
• Patterns are simpler (easy grammars), e.g.<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z
Regular Grammars
• Subset of Context Free Grammars• Every rule contains at most one non-terminal
symbol (or can be rewritten so it does…)
Rewritten Grammar for ID
• Original:<id> -> <letter> <id> | <letter><letter> -> a | b | c | … | z
• Rewrite:<id> -> (a | b | c | … | z) <id> | (a | b | c | … z )
• Fully expanded (52 rules):<id> -> a <id> | b <id> | c <id> … a | b | c |… | z
Parsing using a Regular Grammar
1. Transform the grammar into a state machine2. Implement the state machine in a computer
program– By hand– Automatically, using table-lookup
3. Run this program on input strings
What is a State Machine?
• State machine abstraction– At any time, the process is in a “state”– Each time an “event” happens, the process takes
an “action” and goes to the next state–We can describe the entire algorithm as a diagram
where each state has an arrow for each event/action pair to the next appropriate state
State Machine for a Kitten
Happy
Hungry Sleeping
Food available / Eat Toys available / Play
X hrs passed / Awaken
State Machine for a Language
• Each “event” processes an input symbol• Two important special states– Initial state: state the machine is in before the
first symbol– Final state: state the machine is in whenever the
sequence of symbols up to now is in the language
Transforming a Regular Grammar to a State Machine
• Put the grammar into a form so every rule is<nonterm1> -> symbol <nonterm2><nonterm1> -> symbol
• Make a state for each nonterminal• Make a transition (arrow) for each rule. The
transition goes from <nonterm1> to <nonterm2> based on the symbol.
• The start symbol of the grammar is initial.• There is one final state that every rule that
doesn’t have a nonterminal on the right goes to.
State Machine Example
• <id> -> a <id> | b <id> | a | b
• Two states: id (initial) and f (final)• Example: aabba
Simpler State Machine
• This is a cleaner version of the other machine. Each character, state combination has only one next state.
• It is called a DFA (deterministic finite automaton)
Lexical Analysis for Integer Expressions
From DFA to Program
Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order.
Method lex(s) scans and returns a single Token from a stream.
A Token consists of a type (e.g. INT) and a string (e.g. “1234”)
09/15/10
Defining Constants
• //Number all the statesPublic static final int NUMSTATES = 4;Public static final int START = 0;Public static final int INT = 1;Public static final int ID = 2;Public static final int UNK = 3;Public static final int ERR = 4;
09/15/10
Constructing Transition Table (in constructor)
String chars = “01234abcdef+-()”int[][] tt = new int[[chars.size()][NUMSTATES];tt[ID][5] = ID; // ’a’tt[ID][6] = ID; // ’b’tt[START][5] = ID; // ’a’tt[START][1] = INT;// … etc …tt[ID][0] = ERR;// … etc …
Recognizing Final States
//For this grammar, all states but ERR are final//Usually, this method is a bit more complexboolean final(int state){
return (state != ERR);}
09/15/10
Lex Method
//Read one token from the input ( any Scanner)public static Token lex(Scanner s){ //initialize variables StringBuilder lexeme = new StringBuilder; int state = START; char ch = s.nextChar(); …
09/15/10
Lex Method (cont’d)
//loop through characters, updating statewhile (state != ERR){ oldstate = state; lexeme += ch; state = tt[oldstate][chars.indexOf(ch)]; ch = s.getChar();}
09/15/10
Lex Method (cont’d)
//return the tokenif final(oldstate) //valid token
return new Token(oldstate,lexeme);else //not a valid token – return the chars
return new Token(ERR, lexeme);} //end of lex()
09/15/10
From DFA to Program (cont’d)
Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace
eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false;} return true;
Another Program (pp. 176-181)
• Programmed in C (no classes)• Global variables instead of class variables (used in
many functions, e.g. charClass)• Token (int) and lexeme (string) unconnected
• States and transitions are implicit• Lex() is a big case statement• Many special purpose functions, e.g. getChar(),
addChar(), lookup() executing portions of DFA
09/15/10