Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

30
Lexical Analysis: Regular Expressions CS 671 January 22, 2008

Transcript of Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

Page 1: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

Lexical Analysis:Regular Expressions

CS 671January 22, 2008

Page 2: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

2 CS 671 – Spring 2008

A program that translates a program in one language to another language• the essential interface between applications & architectures

Typically lowers the level of abstraction• analyzes and reasons about the program & architecture

We expect the program to be optimized i.e., better than the original• ideally exploiting architectural strengths and hiding weaknesses

Last Time …

CompilerHigh-Level

Programming Languages

Machine Code

Error Messages

Page 3: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

3 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Lexical Analyzer

•Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants)

•Characters read from a file are buffered – helps decrease latency due to i/o. Lexical analyzer manages the buffer

•Makes use of the theory of regular languages and finite state machines

•Lex and Flex are tools that construct lexical analyzers from regular expression specifications

Page 4: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

4 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Parser

• Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure – an AST

• The parser imposes the syntax rules of the language

• Work should be linear in the size of the input (else unusable) type consistency cannot be checked in this phase

•Deterministic context free languages and pushdown automata for the basis

• Bison and yacc allow a user to construct parsers from CFG specifications

Page 5: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

5 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Semantic Analysis

• Calculates the program’s “meaning”

• Rules of the language are checked (variable declaration, type checking)

• Type checking also needed for code generation (code gen for a + b depends on the type of a and b)

Page 6: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

6 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Intermediate Code Generation

• Makes it easy to port compiler to other architectures (e.g. Pentium to MIPS)

• Can also be the basis for interpreters (such as in Java)

• Enables optimizations that are not machine specific

Page 7: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

7 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Intermediate Code Optimization

• Constant propagation, dead code elimination, common sub-expression elimination, strength reduction, etc.

• Based on dataflow analysis – properties that are independent of execution paths

Page 8: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

8 CS 671 – Spring 2008

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code optimizer

Code generator

Source program

Target program

Native Code Generation

• Intermediate code is translated into native code

• Register allocation, instruction selection

Native Code Optimization

• Peephole optimizations – small window is optimized at a time

Page 9: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

9 CS 671 – Spring 2008

Administration

1. Compiling to assembly

1. HW1 on website: Fun with Lex/Yacc

2. Questionnaire Results…

Page 10: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

10 CS 671 – Spring 2008

Useful Tools!

•tar – archiving program

•gzip/bzip2 – compression

•svn – version control

•Make/Scons – build/run utility

•Other useful tools:– Man!– Which– Locate– Diff (or sdiff)

Page 11: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

11 CS 671 – Spring 2008

Makefiles

Target: dependent source file(s)

<tab>command

proj1

main.o io.odata.o

main.c io.h io.cdata.hdata.c

Page 12: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

12 CS 671 – Spring 2008

First Step: Lexical Analysis (Tokenizing)•Breaking the program down into words or “tokens”

•Input: stream of characters

•Output: stream of names, keywords, punctuation marks

•Side effect: Discards white space, comments

Source code: if (b==0) a = “Hi”;

Token Stream:

Lexical Analysis

Parsing

Page 13: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

13 CS 671 – Spring 2008

Lexical Tokens

• Identifiers: x y11 elsex _i00

• Keywords: if else while break

• Integers: 2 1000 -500 5L

• Floating point: 2.0 0.00020 .02 1.1e5 0.e-10

• Symbols: + * { } ++ < << [ ] >=

• Strings: “x” “He said, \“Are you?\””

• Comments: /** ignore me **/

Page 14: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

14 CS 671 – Spring 2008

Lexical Tokens

float match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0.;

}

FLOAT ID(match0) _______ CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG _______ LPAREN ID(s) COMMA STRING(0.0) ______ NUM(3) RPAREN RPAREN RETURN REAL(0.0) ______ RBRACE EOF

Page 15: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

15 CS 671 – Spring 2008

Ad-hoc Lexer

• Hand-write code to generate tokens

• How to read identifier tokens?

Token readIdentifier( ) {

String id = “”;

while (true) {

char c = input.read();

if (!identifierChar(c))

return new Token(ID, id, lineNumber);

id = id + String(c);

}

}

Page 16: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

16 CS 671 – Spring 2008

Problems

• Don’t know what kind of token we are going to read from seeing first character

– if token begins with “i’’ is it an identifier?– if token begins with “2” is it an integer? constant?– interleaved tokenizer code is hard to write correctly,

harder to maintain

• More principled approach: lexer generator that generates efficient tokenizer automatically (e.g., lex, flex)

Page 17: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

17 CS 671 – Spring 2008

Issues

• How to describe tokens unambiguously

2.e0 20.e-01 2.0000

“” “x” “\\” “\”\’”

• How to break text down into tokens

if (x == 0) a = x<<1;

if (x == 0) a = x<1;

• How to tokenize efficiently– tokens may have similar prefixes– want to look at each character ~1 time

Page 18: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

18 CS 671 – Spring 2008

How To Describe Tokens

• Programming language tokens can be described using regular expressions

• A regular expression R describes some set of strings L(R)

• L(R) is the language defined by R– L(abc) = { abc }– L(hello|goodbye) = {hello, goodbye}– L([1-9][0-9]*) = _______________

• Idea: define each kind of token using RE

Page 19: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

19 CS 671 – Spring 2008

Regular expressions

Language – set of strings

String – finite sequence of symbols

Symbols – taken from a finite alphabet

Specify languages using regular expressions

Symbol a one instance of a

Epsilon empty string

Alternation R | S string from either L(R) or L(S)

Concatenation

R ∙ Sstring from L(R) followed by L(S)

Repetition R*

Page 20: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

20 CS 671 – Spring 2008

Convenient Shorthand

[abcd] one of the listed characters (a | b | c | d)

[b-g] [bcdefg]

[b-gM-Qkr] ____________

[^ab]anything but one of the listed chars

[^a-f] ____________

M? Zero or one M

M+ One or more M

M* ____________

“a.+*” literally a.+*

. Any single character (except \n)

Page 21: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

21 CS 671 – Spring 2008

Examples

Regular Expression Strings in L(R)

digit = [0-9] “0” “1” “2” “3” …

posint = digit+ “8” “412” …

int = -? posint “-42” “1024” …

real = int (ε | (. posint)) “-1.56” “12” “1.0”

[a-zA-Z_][a-zA-Z0-9_]* C identifiers

• Lexer generators support abbreviations– But they cannot be recursive

Page 22: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

22 CS 671 – Spring 2008

More Examples

Whitespace:

Integers:

Hex numbers:

Valid UVa User Ids:

Loop keywords in C:

Page 23: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

23 CS 671 – Spring 2008

Breaking up Text

elsex=0;

•REs alone not enough: need rules for choosing

•Most languages: longest matching token wins– even if a shorter token is only way

•Ties in length resolved by prioritizing tokens

•RE’s + priorities + longest-matching token rule = lexer definition

else x = 0 ;

elsex = 0 ;

Page 24: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

24 CS 671 – Spring 2008

Lexer Generator Specification

• Input to lexer generator:– list of regular expressions in priority order– associated action for each RE (generates

appropriate kind of token, other bookkeeping)

• Output: – program that reads an input stream and breaks it

up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” )

Page 25: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

25 CS 671 – Spring 2008

Lex: A Lexical Analyzer Generator

Lex produces a C program from a lexical specification http://www.epaperpress.com/lexandyacc/

%%

DIGITS [0-9]+

ALPHA [A-Za-z]

CHARACTER {ALPHA}|_

IDENTIFIER {ALPHA}({CHARACTER}|{DIGITS})*

%%

if {return IF; }

{IDENTIFIER} {return ID; }

{DIGITS} {return NUM; }

([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) {return ____; }

. {error(); }

Page 26: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

26 CS 671 – Spring 2008

Lexer Generator

• Reads in list of regular expressions R1,…Rn, one per token, with attached actions

-?[1-9][0-9]* { return new Token(Tokens.IntConst,

Integer.parseInt(yytext())

}

• Generates scanning code that decides:1. whether the input is lexically well-formed2. corresponding token sequence

• Problem 1 is equivalent to deciding whether the input is in the language of the regular expression

• How can we efficiently test membership in L(R) for arbitrary R?

Page 27: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

27 CS 671 – Spring 2008

Regular Expression Matching

• Sketch of an efficient implementation:– start in some initial state– look at each input character in sequence, update

scanner state accordingly– if state at end of input is an accepting state, the input

string matches the RE

• For tokenizing, only need a finite amount of state: (deterministic) finite automaton (DFA) or finite state machine

Page 28: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

28 CS 671 – Spring 2008

High Level View

Regular expressions = specification

Finite automata = implementation

Every regex has a FSA that recognizes its language

Scanner

ScannerGenerator

source code

specification

tokens

Design time

Compile time

Page 29: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

29 CS 671 – Spring 2008

Finite Automata

Takes an input string and determines whether it’s a valid sentence of a language

– A finite automaton has a finite set of states– Edges lead from one state to another– Edges are labeled with a symbol– One state is the start state– One or more states are the final state

0 1 2

i f

IF

0 1

a-za-z

0-9ID

26 edges

Page 30: Lexical Analysis: Regular Expressions CS 671 January 22, 2008.

30 CS 671 – Spring 2008

Language

Each string is accepted or rejected

1. Starting in the start state

2. Automaton follows one edge for every character (edge must match character)

3. After n-transitions for an n-character string, if final state then accept

Language: set of strings that the FSA accepts

0 1 2

i f

IFID3

ID

[a-z0-9]

[a-z0-9]

[a-hj-z]