CS262 Notes (1)

CS262 Programming Languages

UNIT 1 – String Patterns

Building a Web Browser

HTMLJavaScript

WebPage

WebBrowser

HMTL – Web page basics JavaScript – Web page computations

hello

Web Page Source

hello1+2+3

Break it up

into importantwords

Understand

the structure

1 + / \

meaning

The goal is to use the web browser to structure the learning. Breaking up strings in Python: “Hello world”.find(“ ”) --> 5

“1 + 1 = 2”.find(“1”,2[starting position]) --> 4

“haystack”.find(“neddle”) --> -1 [not found]

Selecting Substrings: “hello”[1[start here]:3[up to but not including]] --> “el”

“hello”[1:[go as far as possible]] --> “ello”

Splitting Words by Whitespace: “Jane Eyre”.split() --> [“Jane”, “Eyre”]

We need more control over splitting strings --> Regular Expressions Regular Expressions: [1-3] –(matches or denotes)-> “1” “2” “3”

[a-b] --> “a” “b”

1 2a-z

A module is a repository of library of functions and data. In Python, import brings in a module: import re

5-letter string: “[0-9]” Regular Expression: r“[0-9]” [matches 10 1-letter strings] findall takes a r.e. and a string, and returns a list of all of the substrings that match that r.e: re.findall(r“[0-9]”,“1+2==3”) --> [“1”, “2”, “3”]

re.findall(r“[1-2]”,“1+2==3”) --> [“1”, “2”]

re.findall(r“[a-c]”,“Barbara Liskov”) --> [“a”, “b”, “a”, “a”]

We’ll need to find /> and == for JavaScript and HTML. Thus, we need to express concatenation and repetition, to match more complicated [compound] strings: r“[a-c][1-2]” --> “a1” “a2” “b1” “b2” “c1” “c2”

r“[0-9][0-9]” --> “00” “01” “02” ... “99”

re.findall(r“[0-9][0-9]”,“July 28, 1821”) --> [“28”, “18”, “21”]

re.findall(r“[0-9][0-9]”,“12345”) --> [“12”, “34”]

re.findall(r“[a-z][0-9]”,“a1 2b cc3 44d”) --> [“a1”, “c3”]

+ (One or More times) Regular Expression: r“a+[looks back to the previous r.e.]” --> “a” “aa” “aaa” “aaaa” “aaaaa” ...

r“[0-1]+” -(matches)-> “0” “1” “00” “11” “01” “100” “1101” ...

Maximum Munch (an r.e. should consume the biggest string it can and not smaller parts): re.findall(r“[0-9]+”, “13 from 1 in 1776”) --> [“13”, “1”, “1776”]

re.findall(r“[0-9][ [match space]][0-9]+”, “a1 2b cc3 44d”) --> [“1 2”, “3 44”]

Finite State Machines --> A visual representation of regular expressions Suppose we want r“[0-9]+%[% is matched directly]”

1 2 30-9

Start here

States

Edges orTransitions

Accepting State

| (Disjunction or OR) Regular Expression: r“[a-z]+|[0-9]+”

re.findall(r“[a-z]+|[0-9]+”,“Goethe 1749”) --> [“Goethe”, “1749”]

Optional Components (<something>|<nothing>):

ϵno input

orthe empty string

0-920-9

moreconsise

? (Optional, Zero or One time) Regular Expression: re.findall(r“-?[0-9]+”,“1861-1941 R. Tagore”) --> [“1861”, “-1941”] * (Zero or More times) Regular Expression: a+ aa* So now + * ? [ ] all mean something special in regular expressions. But if we want to refer to the symbols themselves, we can use: Escape Sequences. Escape Sequences: \ - Escape Character

r“\+\+” –(matches)-> “++” . (any character (except newline)) Regular Expression: re.findall(r“[0-9].[0-9]”,“1a1 222 cc3”) --> [“1a1”, “222”] ^ [caret] (anything except something) Regular Expression: re.findall(r“[0-9][^ab]”,“1a1 222 cc3”) --> [“1 ”, “22”, “2 ”]

(?: ) (parentheses to show structure) Regular Expression: re.findall(r“(?:do|re|mi)+”,“mimi rere midore doo-wop”) -->

[“mimi”, “rere”, “midore”, “do”] How to represent (encode) a FSM? – Dictionaries!

edges[(1,“a”)] = 2

Dictionaries: is_flower = {}

is_flower[‘rose’] = True

is_flower[‘dog’] = False

or is_flower = {‘rose’: True,

‘dog’: False}

>>> is_flower[‘rose’]

>>> is_flower[‘juliet’]

<Error>

Tuples: Tuples are immutable lists. point = (1,5)

point[0] == 1

point[1] == 5

Let’s encode r“a+1+”:

1 2 3a

edges = {(1,‘a’): 2, (2,‘a’): 2, (2,1): 3, (3,1): 3}

accepting = [3]

FSM Simulator: fsmsim(<string>, <starting state>, <edges>, <accepting>) --> True,

if the <string> is accepted by the FSM (<edges>,<accepting>)

def fsmsim(string,current,edges,accepting):

if string == “”:

return current in accepting

letter = string[0]

if (current, letter) in edges:

destination = edges[(current, letter)]

remaining_string = string[1:]

return fsmsim(remaining_string,destination,edges,accepting)

return False

Handling Epsilon and Ambiguity:

A FSM accepts a string s if there exists even one path from the start state to any accepting state following s. “easy-to-write” FSMs with epsilon transitions or ambiguity are known as non-deterministic (you may not know exactly where to go) finite state machines. A “lock-step” FSM with no epsilon edges or ambiguity is a deterministic finite state machine. [fsmsim can handle these]

Every non-deterministic FSM has a corresponding deterministic FSM that accepts exactly the same strings.

Non-deterministic FSMs are NOT more powerful, they are just more convenient. Idea: Build a deterministic machine D where every state in D corresponds to a SET of states in the non-deterministic machine.

Example: r“ab?c”

2,3,4,6b

noepsilon

noambiguity

Example 2:

1 2,3a

2,4,5,6b

noepsilon

noambiguity4

Wrap Up: STRINGS – sequences of characters. REGULAR EXPRESSIONS – concise notation for specifying sets of strings. More flexible than fixed string matching. (phone numbers, words, numbers, quoted strings) <-- search for and match them. FINITE STATE MACHINES – pictorial equivalent to regular expressions. DETERMINISTIC – every FSM can be converted to a deterministic FSM. FSM SIMULATION – it is very easy (~10 lines of recursive code) to see if a det FSM accepts a string. Simulating Non-Deterministic FSMs: def nfsmsim(string, current, edges, accepting):

if string == “”:

return current in accepting

letter = string[0:1]

if (current, letter) in edges:

remainder = string[1:]

newstates = edges[(current, letter)]

for next_state in newstates:

if nfsmsim(remainder, nextstate, edges, accepting):

return True

return False

LexicalAnalysis

String List of

Tokens

Reading Machine Minds (Identifying empty FSMs):

def nfsmaccepts(current, edges, accepting, visited):

if current in visited:

return None

elif current in accepting:

return “”

newvisited = visited + [current]

for edge in edges:

if edge[0] == current:

for newstate in edges[edge]:

foo = nfsmaccepts(newstate, edges, accepting, newvisited)

if foo != None:

return edge[1] + foo

return None

UNIT 2 – Lexical Analysis

A Lexical Analyzer is a program that reads in a web page or a bit of JavaScript and breaks it down into words. Specify HTML + JavaScript:

HyperTextMarkupLanguage tells a web browser how to display a web page. (a missing end tag causes the text starting from the start tag to be influenced, all the way to the end) Bold tag: python Underline tag: python Italics tag: python

Anchor tag: <a href = “http://www.google.com”>now!</a> [Here, href (hypertext reference) is an argument to the anchor tag] Paragraph tag: python LEXICAL ANALYSIS – breaking something into words TOKEN – smallest unit of lexical analysis output: words, strings, numbers, punctuation, not whitespace

So, lexical analysis breaks down a string into a list of tokens. Some HTML Tokens:

LANGLE <

LANGLESLASH </

RANGLE >

EQUAL =

STRING “google.com”

WORD Welcome!

We’ll use regular expressions to specify tokens, and this is how we write out token definitions in Python:

def t_RANGLE(token):

r‘>’

return token

token name

regexp matching this token

return text unchanged

Token Values: By default, the value is the string matched.

def t_NUMBER(token):

r‘[0-9]+’

token.value = int(token.value)

return token Quoted Strings are critical to interpreting HTML and JavaScript:

def t_STRING(token):

r‘“[^”]*”’

return token We want to skip of pass over spaces!

def t_WHITESPACE(token):

r‘ ’

pass What’s left is words:

def t_WORD(token):

r‘[^ <>]+’

return token A LEXER (LEXical analyzER) is just a collection of token definitions.

When two token definitions can match the same string, the behavior of our lexical analyzer may be ambiguous. In our implementation, we favor the token definition listed first. String Snipping (remove quotes that are markers for strings and are separate from the meaning):

r‘“[^”]*”’

token.value = token.value[1:-1]

return token

Let’s make a Lexical Analyzer:

import ply.lex as lex

tokens = (

‘LANGLE’, # <

‘LANGLESLASH’, # </

‘RANGLE’, # >

‘EQUAL’, # =

‘STRING’, # “hello”

‘WORD’ # Welcome!

t_ignore = ‘ ’ # shortcut for whitespace

def t_LANGLESLASH(token):

r‘</’

return token

def t_LANGLE(token):

r‘<’

return token

def t_RANGLE(token):

r‘>’

return token

def t_EQUAL(token):

r‘=’

return token

r‘“[^”]*”’

return token

def t_WORD(token):

r‘[^ <>\n]+’

return token

webpage = “This is my webpage!”

# the next line tells our lexical analysis library that we want to use

# all of the token definitions above to make a lexical analyzer, and

# break up strings.

htmllexer = lex.lex()

htmllexer.input(webpage)

while True:

tok = htmllexer.token()

if not tok: break

# tok --> LexToken(<NAME>, <token>, <line>, <character>)

print tok

Tracking Line Numbers:

def t_NEWLINE(token):

r‘\n’

token.lexer.lineno += 1

pass # Comments (documentation, removing functionality):

HTML comments:

Adding support for HTML comments to our lexical analyzer. Comments will be modeled as a separate FSM that will ignore everything. Lexer States:

states = (

# exclusive – if I am in the middle of processing an HTML comment,

# I can’t be doing anything else.

(‘htmlcomment’, ‘exclusive’)

# It goes before the normal lexer

def t_htmlcomment(token):

r‘<!--’

token.lexer.begin(‘htmlcomment’)

def t_htmlcomment_end(token):

r‘-->’

token.lexer.lineno += token.value.count(‘\n’)

token.lexer.begin(‘INITIAL’)

# We’ve said what to do when an HTML comment begins and ends, but

# any other character we see in this special HTML comment mode

# isn’t going to match one of those two rules. So...

def t_htmlcomment_error(token):

token.lexer.skip(1)

# It’s a lot like pass except that is gathers up all of the text

# into one big value so that I can count the newlines later

Introducing JavaScript (with an example):

Welcome to my webpage.

Five factorial (aka 5!) is:

function factorial(n) {

if (n == 0) {

return 1;

return n * factorial(n-1);

document.write(factorial(5));

</script>

Identifiers: A Variable or Function Name They identify a particular value or storage location.

# Identifier for JavaScript

def t_IDENTIFIER(token):

r‘[A-Za-z][A-Za-z_]*’

return token Numbers in JavaScript:

r‘-?[0-9]+(?:\.[0-9]*)?’

token.value = float(token.value)

return token End_of_Line Comments in JavaScript:

def t_eolcomment(token):

r‘//[^\n]*’

pass Wrap Up: TOKENS HTML, JAVASCRIPT Anonymous Functions: Making functions on the fly. Return here is implicit.

# find max element of list_of_words according to the function given

print findmax(lambda(word): word.find(‘python’), list_of_words)

Exercise: Identify Identifiers, and Decimal and Hexadecimal Numbers.

tokens = ( ‘NUM’, ‘ID’ )

def t_ID(token):

r‘[A-Za-z]+’

return token

def t_NUM_hex(token):

r‘0x[0-9a-fA-F]+’

token.value = int(token.value, 0)

token.type = ‘NUM’

return token

def t_NUM_decimal(token):

r‘[0-9]+’

token.value = int(token.value)

token.type = ‘NUM’

return token

t_ignore = ‘ \t\v\r’

def t_error(token):

print “Lexer: unexpected character ” + token.value[0]

token.lexer.skip(1)

Regular Expressions SUBSTITUTE: re.sub(regexp, new_text, haystack)

Exercise: Identify emails that may or may not contain NOSPAM text in it.

import re

tokens = (‘EMAIL’)

def t_EMAIL(token):

r‘[A-Za-z]+@[A-Za-z]+(?:\.[A-Za-z]+)+’

token.value = re.sub(r‘NOSPAM’, ‘’, tok.value)

return token

def t_error(token):

print ‘Lexer: unexpected character ’ + token.value[0]

token.lexer.skip(1)

def addresses(haystack):

lexer = lex.lex()

lexer.input(haystack)

result = []

while True:

tok = lexer.token()

if not tok: break

result += [(tok.value)]

return result

Exercise: JavaScript Comments & Keywords. tokens = (

'ANDAND', # &&

'COMMA', # ,

'DIVIDE', # /

'ELSE', # else

'EQUAL', # =

'EQUALEQUAL', # ==

'FALSE', # false

'FUNCTION', # function

'GE', # >=

'GT', # >

# 'IDENTIFIER', #### Not used in this problem.

'IF', # if

'LBRACE', # {

'LE', # <=

'LPAREN', # (

'LT', # <

'MINUS', # -

'NOT', # !

# 'NUMBER', #### Not used in this problem.

'OROR', # ||

'PLUS', # +

'RBRACE', # }

'RETURN', # return

'RPAREN', # )

'SEMICOLON', # ;

# 'STRING', #### Not used in this problem.

'TIMES', # *

'TRUE', # true

'VAR', # var

states = (

('jscomment','exclusive'),

def t_jscomment(token):

r'/\*'

token.lexer.begin('jscomment')

def t_jscomment_end(token):

r'\*/'

token.lexer.lineno += token.value.count('\n')

token.lexer.begin('INITIAL')

def t_jscomment_error(token):

token.lexer.skip(1)

def t_eolcomment(token):

r'//[^\n]+'

t_ANDAND = r'&&'

t_COMMA = r','

t_DIVIDE = r'/'

t_ELSE = r'else'

t_EQUAL = r'='

t_EQUALEQUAL = r'=='

t_FALSE = r'false'

t_FUNCTION = r'function'

t_GE = r'>='

t_GT = r'>'

t_IF = r'if'

t_LBRACE = r'{'

t_LE = r'<='

t_LPAREN = r'\('

t_LT = r'<'

t_MINUS = r'-'

t_NOT = r'!'

t_OROR = r'\|\|'

t_PLUS = r'\+'

t_RBRACE = r'}'

t_RETURN = r'return'

t_RPAREN = r'\)'

t_SEMICOLON = r';'

t_TIMES = r'\*'

t_TRUE = r'true'

t_VAR = r'var'

t_ignore = ' \t\v\r' # whitespace

t_jscomment_ignore = ' \t\v\r' # whitespace

def t_newline(t):

t.lexer.lineno += 1

def t_error(t):

print "JavaScript Lexer: Illegal character " + t.value[0]

t.lexer.skip(1)

Exercise: JavaScript Numbers and Strings. import ply.lex as lex

tokens = (

'IDENTIFIER',

'NUMBER',

'STRING',

def t_IDENTIFIER(token):

r'[A-Za-z][A-Za-z_]*'

return token

r'-?[0-9]+(?:\.[0-9]*)?'

token.value = float(token.value)

return token

r'"(?:[^”\\]|(?:\\.))*"'

return token

t_ignore = ' \t\v\r' # whitespace

def t_newline(t):

t.lexer.lineno += 1

def t_error(t):

print "JavaScript Lexer: Illegal character " + t.value[0]

t.lexer.skip(1)

FSM Optimization: Removing Dead States.

def nfsmtrim(edges, accepting):

states = []

for e in edges:

states += [e[0]] + edges[e]

live = []

for s in states:

if nfsmaccepts(s,edges,accepting,[]) != None:

live += [s]

new_edges = {}

for e in edges:

if e[0] in live:

new_destinations = []

for destination in edges[e]:

if destination in live:

new_destinations += [destination]

if new_destinations != []:

new_edges[e] = new_destinations

new_accepting = []

for s in accepting:

if s in live:

new_accepting += [s]

return (new_edges, new_accepting)

UNIT 3 – Grammars Lexing --> list of tokens. A list of words isn’t enough. They have to adhere to a valid structure. Grammars give Infinite Utterances, yet not All Utterances. Noam Chomsky --> Utterances have rules (governed by formal grammars) [Grammatical Sentences] Formal Grammars

Sentence --> Subject Verb

Subject --> Teachers

Subject --> Students

Verb --> write

Verb --> think

grammar

rewrite rules

non-terminals

terminals

Recursion in a context-free (recursive) grammar can allow for an infinite number of utterances. Adding to the previous grammar the next rule: subj --> subj and subj we get: Sentence --> subj verb --> subj and subj ver

--> Students and Teachers think

Syntactical Analysis (Parsing): Token List --> Valid in Grammar?

Lexing + Parsing = Expressive Power or word rules + sentence rules = creativity! Statements: stmt --> identifier = exp

exp --> exp + exp

exp --> exp – exp

exp --> number

derivation

Sentence/ \

Subject Verb | |

Students Think

Optional Parts of Languages: Sent --> OptAdj Subj Verb

Subj --> William

Subj --> Tell

OptAdj --> Accurate

OptAdj --> ϵ Verb --> shoots

Verb --> bows Grammars can encode Regular Expressions: number = r‘[0-9]+’

number --> digit more_digits

more_digits --> digit more_digits

more_digits --> ϵ digit --> 0

digit --> 1

digit --> 9

Grammars Regular Expressions Regular Expressions describe Regular Languages Grammars describe Context Free Languages A language L is a context-free language if there exists a context-free grammar G such that the set of strings accepted by G is exactly L. Context-free languages are strictly more powerful than regular languages. Irregularities: Too complicated features that cannot be captured by regular expressions.

Balanced Parentheses are not Regular: p --> ( p )

p --> ϵ r‘\(*\)*’

We are going to use formal grammars to understand or describe HTML and JavaScript. Parse trees are a pictorial representation of the structure of an utterance. A parse tree demonstrates that a string is in the language of a grammar.

expexp

expexp expexp++

expexp expexp++ numnum

numnum numnum 33

exp --> exp + exp

exp --> exp – exp

exp --> number

One trait shared by programming languages and natural languages is ambiguity. A grammar is ambiguous if at least 1 string in the grammar has more than 1 different parse tree. Grammar for HTML & JavaScript: HTML (Partial Grammar) html --> element html

html --> ϵ

element --> word

element --> tag_open html tag_close

tag_open --> < word >

tag_close --> </ word >

JavaScript (Partial Grammar) === expressions ========

exp --> identifier

exp --> number

exp --> string

exp --> exp + exp

exp --> exp - exp

exp --> exp * exp

exp --> exp / exp

exp --> exp < exp

exp --> exp == exp

exp --> exp && exp

exp --> TRUE

exp --> FALSE

=== statements ==========

stmt --> identifier = exp

stmt --> return exp

stmt --> if exp compoundstmt

stmt --> if exp compoundstmt else compoundstmt

compoundstmt --> { stmtS }

stmtS --> stmt ; stmtS

stmtS --> ϵ === function calls and definitions ==========

js --> element js

js --> ϵ element --> function identifier ( optparams ) compoundstmt

element --> stmt ;

optparams --> params

optparams --> ϵ params --> identifier , params

params --> identifier

=== expressions continued ==========

exp --> identifier ( optargs )

optargs --> args

optargs --> ϵ args --> exp , args

args --> exp

htmlhtml

eltelt htmlhtml

ϵ ϵ htmlhtml tctctoto

wordword<< >>

eltelt htmlhtml

wordword

welcomewelcome

wordword</</ >>

bb......

welcome to my webpage!

This subtree (part of the webpage) is influenced by the bold tag

lambda: “Make me a function”, or “I am defining an anonymous function”.

# I’m assigning to the variable mystery the result

# of the lambda expression

mystery = lambda(x): x + 2

print mystery(3) map Function: Takes a function as its first argument, and then a list, and it applies that function to each element of the list in turn, creating a new list.

def mysquare(x): return x*x

print map(mysquare, [1,2,3,4,5]) # [1,4,9,16,25]

or print map(lambda(x): x*x,[1,2,3,4,5])

List Comprehensions:

print [x*x for x in [1,2,3,4,5]] # [1,4,9,16,25]

print [len(x) for x in [‘hello’, ‘my’, ‘friends’]] # [5,2,7] Generators: Filtering data.

def odds_only(numbers):

for n in numbers:

if (n % 2) == 1:

yield n

print [x for x in odds_only([1,2,3,4,5])] # [1,3,5]

or print [x for x in [1,2,3,4,5] if x % 2 == 1] # [1,3,5] Encoding Grammars... A --> B C, ... --> [(‘A’,[‘B’, ‘C’]), ...]

and enumerating strings (a slow way): def expand(tokens, grammar):

for pos in range(len(tokens)):

for rule in grammar:

if tokens[pos] == rule[0]:

yield tokens[0:pos] + rule[1] + tokens[pos+1:]

Reading Machine Minds 2 (Identifying empty Context-free Grammars):

def cfgempty(grammar,symbol,visited):

if symbol in visited:

return None

elif not any([rule[0] == symbol for rule in grammar]):

return [symbol]

new_visited = visited + [symbol]

for rhs in [r[1] for r in grammar if r[0] == symbol]:

if all([None != cfgempty(grammar, r, new_visited) for r in rhs]):

result = []

for r in rhs:

result += cfgempty(grammar, r , new_visited)

return result

return None Infinite Mind Reading (Identify infinite grammars [ones that accept infinite strings]):

def cfginfinite(grammar):

for Q in [rule[0] for rule in grammar]:

def helper(current, visited, sizexy):

if current in visited:

return sizexy > 0

new_visited = visited + [current]

for rhs in [rule[1] for rule in grammar if rule[0] == current]:

for symbol in rhs:

if helper(symbol, new_visited, sizexy + len(rhs) - 1):

return True

return False

if helper(Q, [], 0):

return True

return False Detecting Ambiguity:

def expand(tokens_and_derivation, grammar):

(tokens, derivation) = tokens_and_derivation

for token_pos in range(len(tokens)):

for rule_index in range(len(grammar)):

rule = grammar[rule_index]

if tokens[token_pos] == rule[0]:

yield ((tokens[0:token_pos] + rule[1] + tokens[token_pos+1:]),

derivation + [rule_index])

def isambig(grammar, start, utterance):

enumerated = [ ([start],[]) ]

while True:

new_enumerated = enumerated

for u in enumerated:

for i in expand(u, grammar):

if not i in new_enumerated:

new_enumerated = new_enumerated + [i]

if new_enumerated != enumerated:

enumerated = new_enumerated

return len([x for x in enumerated if x[0] == utterance]) > 1

UNIT 4 – Parsing Given a string s and a grammar G, is s in the language of G? – Lexical analysis, broke the string down into a stream of tokens, and syntactic analysis, took that stream of tokens and checked to see if they adhere to a context-free grammar. BRUTE Force – try all options exhaustively. Memoization: ... is a computer science technique in which we keep a ‘chart’ or ‘record’ of previous computations and compute new values in terms of previous answers. import timeit

t = timeit.Timer(stmt="""

chart = {}

def memofibo(n):

if n <= 2:

return 1

if n-2 not in chart:

chart[n-2] = memofibo(n-2)

if n-1 not in chart:

chart[n-1] = memofibo(n-1)

return chart[n-1] + chart[n-2]

memofibo(25)""")

print t.timeit(number=100)

t2 = timeit.Timer(stmt="""

def fibo(n):

if n <= 2:

return 1

return fibo(n-1) + fibo(n-2)

fibo(25)""")

print t2.timeit(number=100) Parsing State:

S --> εε --> ε + εε --> ε - εε --> 1ε --> 2

input = 1 + 2

ε --> ε + ● εSeen

Not Seen

Parsing State

input = 1 + 2

ε --> 2 ● ε --> ε + ε ● S --> ε ● Parsed!

A PARSING STATE is a rewrite rule from the grammar augmented with 1 ● on the right hand side.

Memoization in our Parser: parse([t1, t2, ..., tN,..., tlast])

chart [N] = all parse states we could be in after seeing t1, t2, ..., tN only! ε --> ε + ε

ε --> int input = int

1 + int

N 0 () 1 (int) 2 (int +) 3 (int + int)

chart[N] ε --> ● ε + ε

ε --> ● int

ε --> int ●

ε --> ε ● + ε

ε --> ε + ● ε

... ...

Must add starting position or from position to our parse states.

N 0 () 1 (int) 2 (int +) 3 (...

chart[N] ε --> ● ε + ε

ε --> ● int1, seen 0

ε --> int ●

ε --> ε ● + ε

ε --> ε + ● ε

ε --> ● int2, seen 2

If we can build the chart, we have solve parsing. if input is T tokens long: S --> ε ● start at 0 in chart[T], then the string is in the language.

start end

chart[0] chart[T]

S --> ● ε from 0 S --> ε ● from 0

MIDDLE?

Making intermediate entries:

S --> ε + ● ε from j in chart[i] # seen i tokens

We are expecting to see an ε, so we need to find all rules ε --> something in grammar, and bring them in. Predicting or Computing the CLOSURE (1 way to complete the parsing chart): chart[i] has X --> a b ● c d from j

for all grammar rules c --> p q r we add c --> ● p q r from i to chart[i] Consuming or Shifting over the Input (1 more way to complete the parsing chart): chart[i] has X --> a b ● c d from j

If c is a terminal, we are shifting over it: X --> a b c ● d from j into chart[i+1] IF c is the i+1-th input token

int + int

ε + int

ε + ε

Reduction: X --> a b ● We reduce by applying the rule in reverse. If a b blah it becomes X blah

Reduction Walkthrough:

T --> a B a

B --> b b input = a b b a

N 0 () 1 (a) 2 (a b)

chart[N] T --> ● a B a, from 0 2T --> a ● B a, from 0

B --> ● b b, from 1 B --> b ● b, from 1

N 3 (a b b) 4 (a b b a)

chart[N]

1B --> b b ●, from 1

3T --> a B ● a, from 0

T --> a B a ●, from 0

AddtoChart: The chart coded in Python: A dictionary chart where: chart[i] = [P --> ( ● P ) from 0, P --> ● ( ) from 1, ... ]

def addtochart(chart, index, state):

if state in chart[index]:

return False

chart[index] += [state]

return True

Encode Grammar: S --> P

P --> ( P )

grammar = [(‘S’,[‘P’]),

(‘P’,[‘(’, ‘P’, ‘)’]),

(‘P’,[])]

Encode Parsing States: X --> a b ● c d from j state = (‘X’, [‘a’, ‘b’], [‘c’, ‘d’], j)

Writing Closure: def closure (grammar, i, x, ab, cd, j):

return [(rule[0], [], rule[1], i) for rule in grammar \

if cd <> [] and rule[0] == cd[0]] Writing Shift: def shift(tokens, i, x, ab, cd, j):

if cd <> [] and tokens[i] == cd[0]:

return (x,ab+cd[:1],cd[1:],j) Writing Reduction: def reductions(chart, i, x, ab, cd, j):

return [ (state[0], state[1]+[x], state[2][1:], state[3]) for state \

in chart[j] if cd == [] and state[2] <> [] and state[2][0] == x ]

Putting All Together: def parse(tokens, grammar):

tokens += [“end_of_input_marker”]

chart = {}

start_rule = grammar[0] # By Convention, the first rule in the grammar

for i in range(len(tokens)+1):

chart[i] = []

start_state = (start_rule[0], [], start_rule[1], 0)

chart[0] = [start_state]

for i in range(len(tokens)):

while True:

changes = False

for state in chart[i]:

# State == X --> a b . c d, from j

x = state[0]

ab = state[1]

cd = state[2]

j = state[3]

next_states = closure(grammar, i, x, ab, cd, j)

for next_state in next_states:

changes = addtochart(chart, i, next_state) or changes

next_state = shift(tokens, i, x, ab, cd, j)

if next_state <> None:

changes = addtochart(chart, i+1, next_state) or changes

next_states = reductions(chart, i, x, ab, cd, j)

for next_state in next_states:

changes = addtochart(chart, i, next_state) or changes

if not changes:

accepting_state = (start_rule[0], start_rule[1], [], 0)

return accepting_state in chart[len(tokens)-1]

Parse Trees: We also need to produce parse trees to get their meaning and interpret HTML and JavaScript programs. The format we are going to use for our parse trees is nested tuples.

def p_exp_number(p):

‘exp : NUMBER’

p[0] = (“number”, p[1])

parse rule

returned parse tree

rhs parse trees

def p_exp_not(p):

‘exp : NOT exp’

p[0] = (“not”, p[2])0 1 2

Parsing Tags: def p_elt_tag(p):

‘elt : LANGLE WORD tag_args RANGLE html LANGLESLASH WORD RANGLE’

p[0] = (‘tag-element’, p[2], p[3], p[5], p[7])

Parsing JavaScript: def p_exp_binop(p):

‘exp : exp PLUS exp

| exp MINUS exp

| exp TIMES exp’

p[0] = (‘binop’, p[1], p[2], p[3]) Setting Associativity and Precedence: Issues need to be resolved are associativity and precedence.

precedence = (

# lower precedence

(‘left’, ‘PLUS’, ‘MINUS’)

(‘left’, ‘TIMES’, ‘DIVIDE’)

# higher precedence

Parsing JavaScript Statements: import ply.yacc as yacc

import jstokens # use our JavaScript lexer

from jstokens import tokens # use our JavaScript tokens

start = 'js' # the start symbol in our grammar

def p_js(p):

'js : element js'

p[0] = [p[1]] + p[2]

def p_js_empty(p):

'js : '

p[0] = [ ]

def p_element_function(p):

'element : FUNCTION IDENTIFIER LPAREN optparams RPAREN compoundstmt'

p[0] = ('function', p[2], p[4], p[6])

def p_element_statement(p):

'element : stmt SEMICOLON'

p[0] = ('stmt', p[1])

def p_optparams(p):

'optparams : params'

p[0] = p[1]

def p_optparams_empty(p):

'optparams : '

p[0] = []

def p_params(p):

'params : IDENTIFIER COMMA params'

p[0] = [ p[1] ] + p[3]

def p_params_last(p):

'params : IDENTIFIER'

p[0] = [ p[1] ]

def p_compoundstmt(p):

'compoundstmt : LBRACE statements RBRACE'

p[0] = p[2]

def p_statements(p):

'statements : stmt SEMICOLON statements'

p[0] = [ p[1] ] + p[3]

def p_statements_empty(p):

'statements : '

p[0] = []

def p_stmt_if_then(p):

'stmt : IF exp compoundstmt'

p[0] = ('if-then', p[2], p[3])

def p_stmt_if_then_else(p):

'stmt : IF exp compoundstmt ELSE compoundstmt'

p[0] = ('if-then-else', p[2], p[3], p[5])

def p_stmt_assignment(p):

'stmt : IDENTIFIER EQUAL exp'

p[0] = ('assign', p[1], p[3])

def p_stmt_return(p):

'stmt : RETURN exp'

p[0] = ('return', p[2])

def p_stmt_var(p):

'stmt : VAR IDENTIFIER EQUAL exp'

p[0] = ('var', p[2], p[4])

def p_stmt_exp(p):

'stmt : exp'

p[0] = ('exp', p[1])

# For now, we will assume that there is only one type of expression.

def p_exp_identifier(p):

'exp : IDENTIFIER'

p[0] = ("identifier",p[1])

jslexer = lex.lex(module=jstokens)

jsparser = yacc.yacc()

jslexer.input(input_string)

parse_tree = jsparser.parse(input_string,lexer=jslexer)

print parse_tree

Parsing Javascript Expressions: import ply.yacc as yacc

import jstokens # use our JavaScript lexer

from jstokens import tokens # use our JavaScript tokens

start = 'exp' # we'll start at expression this time

precedence = (

('left', 'OROR'),

('left', 'ANDAND'),

('left', 'EQUALEQUAL'),

('left', 'LT', 'GT', 'LE', 'GE'),

('left', 'PLUS', 'MINUS'),

('left', 'TIMES', 'DIVIDE', 'MOD'),

('right', 'NOT')

def p_exp_identifier(p):

'exp : IDENTIFIER'

p[0] = ("identifier",p[1])

def p_exp_number(p):

'exp : NUMBER'

p[0] = ('number',p[1])

def p_exp_string(p):

'exp : STRING'

p[0] = ('string',p[1])

def p_exp_true(p):

'exp : TRUE'

p[0] = ('true',p[1])

def p_exp_false(p):

'exp : FALSE'

p[0] = ('false',p[1])

def p_exp_not(p):

'exp : NOT exp'

p[0] = ('not', p[2])

def p_exp_parens(p):

'exp : LPAREN exp RPAREN'

p[0] = p[2]

def p_exp_lambda(p):

'exp : FUNCTION LPAREN optparams RPAREN compoundstmt'

p[0] = ("function", p[3], p[5])

def p_exp_binop(p):

"""exp : exp OROR exp

| exp ANDAND exp

| exp EQUALEQUAL exp

| exp MOD exp

| exp LT exp

| exp GT exp

| exp LE exp

| exp GE exp

| exp PLUS exp

| exp MINUS exp

| exp TIMES exp

| exp DIVIDE exp"""

p[0] = ('binop', p[1], p[2], p[3])

def p_exp_call(p):

'exp : IDENTIFIER LPAREN optargs RPAREN'

p[0] = ('call', p[1], p[3])

def p_optargs(p):

'optargs : args'

p[0] = p[1]

def p_optargs_empty(p):

'optargs : '

p[0] = []

def p_args(p):

'args : exp COMMA args'

p[0] = [ p[1] ] + p[3]

def p_args_last(p):

'args : exp'

p[0] = [ p[1] ]

jsparser = yacc.yacc()

jslexer.input(input_string)

parse_tree = jsparser.parse(input_string,lexer=jslexer)

print parse_tree

Numbers

Strings

“a”

“hello”

‘world’

len[1:-1]+

[1, 2, 3]

[“a”, “b”]

Each operation has a different meaning for different types of data.

Nelson Mandela was elected democratically.

graphics.word(“Nelson”)graphics.word(“Mandela”)graphics.begintag(“b”, { })graphics.word(“was”)graphics.word(“elected”)graphics.endtag( )graphics.word(“democratically”)

Nelson Mandelawas electeddemocratically.

UNIT 5 – Interpreting

A bug is just an instance where the program’s meaning is different from its specification. But in practice a lot of the time the mistake is actually with the specification. Regardless of whether the problem is with the source code or the specification, understanding what code means in context is critical to figuring out if it’s right or wrong. Interpreters: An interpreter finds the meaning of a program by traversing its parse tree. String of HTML + JavaScript --> Break it down to words (Lexical

Analysis) --> Parse those into a tree (Syntactic Analysis) --> Walk

that tree and understand it (Semantics or Interpreting).

Syntax vs. Semantics: Lexing and parsing deal with the form of an utterance. We now turn our attention to semantics, the meaning of an utterance. A well-formed sentence in a natural language can be "meaningless" or "hard to interpret". Similarly, a syntactically valid program can lead to a run-time error if we try to apply the wrong sort of operation to the wrong sort of thing (e.g., 3 + "hello"). Semantic Analysis: The process of looking at a program’s source code and trying to see if it’s going to be well-behaved or not is known as type checking or semantic analysis. [One goal of semantic analysis is to notice and rule out bad programs (i.e., programs that will apply the wrong sort of operation to the wrong sort of object). This is often called type checking.] Types: A type is a set of similar objects (e.g., number or string or list) with an associated set of valid operations (e.g., addition or length).

Graphics: Render a webpage. We’ll use a library to do that for us.

Writing an Interpreter: All there is in HTML is word-elements, tag-elements and javascript-elements, and we see how to handle the first two. import graphics

def interpret(trees): # Hello, friend

for tree in trees: # Hello,

# ("word-element","Hello")

nodetype=tree[0] # "word-element"

if nodetype == "word-element":

graphics.word(tree[1])

elif nodetype == "tag-element":

# Strong text

tagname = tree[1] # b

tagargs = tree[2] # []

subtrees = tree[3] # ...Strong Text!...

closetagname = tree[4] # b

if not tagname == closetagname:

graphics.warning('Tag mismatch!')

graphics.begintag(tagname,tagargs)

interpret(subtrees)

graphics.endtag()

# Note that graphics.initialize and finalize will only work surrounding a call to interpret

graphics.initialize() # Enables display of output.\

interpret([("word-element","Hello,"),("tag-element", 'b', [], [('word-

element', 'World!')], 'b')])

graphics.finalize() # Enables display of output. Arithmetic: For the javascript-elements, we’ll need to interpret the code to a string, and then call graphics.word() on that string. However, JavaScript is semantically richer than HTML and the process of interpretation won’t be that simple. We are going to write a recursive procedure to interpret JavaScript arithmetic expressions. The procedure will walk over the parse tree of the expression. This is sometimes called evaluation.

# Write an eval_exp procedure to interpret JavaScript arithmetic expressions.

# Only handle +, - and numbers for now.

def eval_exp(tree):

# ("number" , "5")

# ("binop" , ... , "+", ... )

nodetype = tree[0]

if nodetype == "number":

return int(tree[1])

elif nodetype == "binop":

left_child = tree[1]

operator = tree[2]

right_child = tree[3]

left_value = eval_exp(left_child)

right_value = eval_exp(right_child)

if operator == "+":

return left_value + right_value

elif operator == "-":

return left_value - right_value

Context: We need to know the values of variables — the context — to evaluate an expression. The meaning of

x+2 depends on the meaning of x (the current state of x). State: The state of a program execution is a mapping from variable names to values. Evaluating an expression requires us to know the current state. To evaluate x+2, we’ll keep around a mapping {‘x’: 3} (it will get more complicated later). This mapping is called the state. Variable Lookup: # ("binop", ("identifier","x"), "+", ("number","2"))

def eval_exp(tree, environment):

nodetype = tree[0]

if nodetype == "number":

return int(tree[1])

elif nodetype == "binop":

left_value = eval_exp(tree[1], environment)

operator = tree[2]

right_value = eval_exp(tree[3], environment)

if operator == "+":

return left_value + right_value

elif operator == "-":

return left_value - right_value

elif nodetype == "identifier":

variable_name = tree[1]

return env_lookup(environment, variable_name) Control Flow: Python and JavaScript have conditional statements like if — we say that such statements can change the flow of control through the program. Program elements that can change the flow of control, such as if or while or return, are often called statements. Typically statements contain expressions but not the other way around. Evaluating Statements: def eval_stmts(tree, environment):

stmttype = tree[0]

if stmttype == "assign":

# ("assign", "x", ("binop", ..., "+", ...)) <=== x = ... + ...

variable_name = tree[1]

right_child = tree[2]

new_value = eval_exp(right_child, environment)

env_update(environment, variable_name, new_value)

elif stmttype == "if-then-else": # if x < 5 then A;B; else C;D;

conditional_exp = tree[1] # x < 5

then_stmts = tree[2] # A;B;

else_stmts = tree[3] # C;D;

if eval_exp(conditional_exp, environment):

eval_stmts(then_stmts,environment)

eval_stmts(else_stmts,environment)

def eval_exp(exp, env):

etype = exp[0]

if etype == "number":

return float(exp[1])

elif etype == "string":

return exp[1]

elif etype == "true":

return True

elif etype == "false":

return False

elif etype == "not":

return not(eval_exp(exp[1], env))

def env_update(env, vname, value):

env[vname] = value Scope: We use the term scope to refer to the portion of a program where a variable has a particular value. So, environment CANNOT be a flat mapping {}. Identifiers and Storage Places: Because the value of a variable can change, we will use explicit storage locations to track the current values of variables.

Environments: There is a special global environment that can hold variable values. Other environments have parent pointers to keep track of nesting or scoping. Environments hold storage locations and map variables to values. Chained Environments: The process upon a function call is 1. Create a new environment. Its parent is the current environment. 2. Create storage places in the new environment for each formal parameter. 3. Fill in those places with the values of the actual arguments. 4. Evaluate the function body in the new environment.

x = “outside x”

y = “outside y”

Global

x = “os lusiadas”

def myfun(x):

print x

print y

myfun(y+5)

parent

}Environment

gretting = “hola”

def makegreeter(greeting):

def greeter(person):

print greeting + “ ” + person

return greeter

sayhello = makegreeter(“hello from uttar pradesh”)

sayhello(“lucknow”) # hello from uttar pradesh lucknow

greeting : “hola”

makegreeter : ...

sayhello : ...greeting : “hello \

from uttar pradesh”

greeter : ...person : “lucknow”

Global

makegreeter

greeter

Environments Needs: 1. Map variables to values, 2. Point to parent environment. So, we’ll encode an environment as: (parent_pointer, dictionary). def env_lookup(vname, env):

# (parent, dictionary)

if vname in env[1]:

return env[1][vname]

elif env[0] == None:

return None:

return env_lookup(vname,env[0])

def env_update(vname, value, env):

if vname in env[1]:

env[1][vname] = value

elif not env[0] == None:

env_update(vname,value,env[0])

Catching Errors: Modern programming languages use exceptions to notice and handle run-time errors. "try-catch" or "try-except" blocks are syntax for handling such exceptions. try:

print “hello”

# only runs if guarded block has an error

except Exception as problem:

print “didn’t work”

print problem

Frames: # Return will throw an excception

# Function Calls: new environments, catch return values

def eval_stmt(tree,environment):

stmttype = tree[0]

if stmttype == "call": # ("call", "sqrt", [("number","2")])

fname = tree[1] # "sqrt"

args = tree[2] # [ ("number", "2") ]

fvalue = env_lookup(fname, environment)

if fvalue[0] == "function":

# We'll make a promise to ourselves:

# ("function", params, body, env)

fparams = fvalue[1] # ["x"]

fbody = fvalue[2]

fenv = fvalue[3]

if len(fparams) <> len(args):

print "ERROR: wrong number of args"

new_env = (fenv, dict((fparams[i], \

eval_exp(args[i],environment)) for i in range(len(args))))

eval_stmts(fbody,new_env)

return None

except Exception as return_value:

return return_value

print "ERROR: call to non-function"

elif stmttype == "return":

retval = eval_exp(tree[1],environment)

raise Exception(retval)

elif stmttype == "exp":

eval_exp(tree[1],environment)

def env_lookup(vname,env):

if vname in env[1]:

return (env[1])[vname]

elif env[0] == None:

return None

return env_lookup(vname,env[0])

def env_update(vname,value,env):

if vname in env[1]:

(env[1])[vname] = value

elif not (env[0] == None):

env_update(vname,value,env[0])

def eval_exp(exp,env):

etype = exp[0]

if etype == "number":

return float(exp[1])

elif etype == "binop":

a = eval_exp(exp[1],env)

op = exp[2]

b = eval_exp(exp[3],env)

if op == "*":

return a*b

elif etype == "identifier":

vname = exp[1]

value = env_lookup(vname,env)

if value == None:

print "ERROR: unbound variable " + vname

return value

def eval_stmts(stmts,env):

for stmt in stmts:

eval_stmt(stmt,env)

sqrt = ("function",("x"),(("return",("binop",("identifier","x"), \

"*",("identifier","x"))),),{})

environment = (None,{"sqrt":sqrt})

print eval_stmt(("call","sqrt",[("number","2")]),environment) Function Definitions:

function myfun(x) {

return x+1;

fname fparams

env[fname] = (“function”, fparams, fbody, fenv)

environment we were in when the function was defined

def eval_elt(tree,env):

elttype = tree[0]

if elttype == “function”:

fname = tree[1]

fparams = tree[2]

fbody = tree[3]

fvalue = (“function”, fparams, fbody, env)

add_to_env(env, fname, fvalue)

Double Edged Sword: We can simulate JavaScript programs with our interpreter written in Python. That means that anything that can be done in JavaScript could be done in Python as well. It turns out that JavaScript could also simulate Python. So they are equally powerful! (Turing Complete. Turing Machine – a mathematical model of computation.) Natural Language Power: While most computer languages are equivalent (in that any computation that can be done in one can also be done in another), it is debated whether the same is true for natural languages. Infinite Loops: Computer programs can contain infinite loops. A program either terminates (halts) in finite time or loops forever. We would like to tell if a program loops forever or not. It is provably impossible to write a procedure that can definitely tell if every other procedure loops forever or not.

This Sentence is False: If tsif halts, then it loops forever. If tsif loops forever, then it halts. Both cases lead to a contradiction. Therefore, halts() cannot exist. Adding While Loop to the JavaScript Interpreter: def eval_while(while_stmt, env):

conditional = while_stmt[1]

loop_body = while_stmt[2]

while eval_exp(conditional, env):

eval_stmts(loop_body,env)

def eval_while(while_stmt, env):

conditional_exp = while_stmt[1]

loop_body = while_stmt[2]

if eval_exp(conditional_exp, env):

evalstmts(loop_body, env)

eval_while(while_stmt, env)

UNIT 6 – Building a Web Browser Web Browser Architecture: Our HTML lexer, parser and interpreter will drive the main process; our JavaScript lexer, parser and interpreter will serve as subroutines.

1. Web page is lexed and parsed. 2. HTML interpreter walks the Abstract Syntax Tree, and calls the JavaScript interpreter. 3. JavaScript code calls write(). 4. JavaScript interpreter stores text from write(). 5. HTML interpreter calls graphics library. 6. Final image of web page is created. Fitting Them Together: We change our HTML lexer to recognize embedded JavaScript fragments as single tokens (we treat JavaScript as a single HTML token). We'll pass the contents of those tokens to our JavaScript lexer, parser and interpreter later.

def t_javascript(token):

# Several backslashes may be unnecessary, but they are there to

# make sure that the re will be interpreted correctly in any case.

# This is called Defensive programming, and it is more commonly invoked

# when dealing with security or correctness requirements.

r‘\<script\ type=\“text\/javascript\”\>’

token.lexer.code_start = token.lexer.lexpos

token.lexer.begin(“javascript”)

def t_javascript_end(token):

r‘\<\/script\>’

token.value = token.lexer.lexdata[token.lexer.code_start: \

token.lexer.lexpos-9]

token.type = ‘JAVASCRIPT’

token.lexer.lineno += token.value.count(‘\n’)

token.lexer.begin(‘INITIAL’)

return token

def tsif( ):

if halts(tsif):

while True:

x = x + 1

return 0

Extending our HTML Grammar: We extend our HTML parser to handle our special token representing embedded JavaScript. def p_element_javascript(p):

‘element : JAVASCRIPT’

p[0] = (“javascript-element”, p[1])

HTML Interpreter on JavaScript Elements:

def interpret(trees):

for tree in trees:

treetype = tree[0]

if treetype == “word-element”:

graphics.word(tree[1])

elif treetype == “tag-element”:

elif treetype == “javascript-element”:

jstext = tree[1]

jsparser = yacc.yacc(module=jsgrammar)

jstree = jsparser.parse(jstext, lexer=jslexer)

result = jsinterp.interpret(jstree)

graphics.word(result)

JavaScript Output: A JavaScript program may contain zero, one or many calls to write(). We will use environments to capture the output of a JavaScript program. Assume every call to write appends to the special “javascript output” variable in the global environment.

def interpret(trees):

global_env = (None, {“javascript output”: “”})

for elt in trees:

eval_elt(elt,global_env)

return global_env[1][“javascript output”]

JavaScript Interpreter, Updating Output:

def eval_exp(tree,env):

exptype = tree[0]

if exptype == “call”:

fname = tree[1]

fargs = tree[2]

fvalue = env_lookup(fname,env)

if fname == “write”:

argval = eval_exp(fargs[0], env)

output_sofar = env_lookup(“javascript output”, env)

env_update(“javascript output”, \

output_sofar + str(argval), env)

return None

Debugging: A good test case gives us confidence that a program implementation adheres to its specification. In this situation, a good test case reveals a bug. Testing: We use testing to gain confidence that an implementation (a program) adheres to it specification (the task at hand). If a program accepts an infinite set of inputs, testing alone cannot prove that program's correctness. Software maintenance (i.e., testing, debugging, refactoring) carries a huge cost. Testing In Depth: When developing a project, there are two ways we could go ahead. Either by planning and reasoning about the implementation in advance and then write the code with a high confidence that it will be free of bugs, or because of (time) constraints just implement it and then test the implementation. To test the implementation, we develop test cases (code that uses the program we would like to test), and if we observe a bug, we start commenting out lines (Fault Localization) of the test file, going back and forth with the commenting/uncommenting of the lines, see if it still “breaks”, and manage to pinpoint the bug. Anonymous Functions in our JavaScript Interpreter: def eval_exp(tree,env):

exptype = tree[0]

if exptype == “function”:

# function(x,y) { return x+y; }

fparams = tree[1]

fbody = tree[2]

return (“function”, fparams, fbody, env)

# For an anonymous function, we don’t add it to the

# environment, unless the user assigns it.

Optimization: An optimization improves the performance of a program while retaining its meaning (i.e., without changing the output). Implementing Optimizations:

1. Think of optimizations. (e.g., x = x + 0, x = x * 1) 2. Transform parse tree (directly). Note: Replacing an expensive multiplication with a cheaper addition is an instance of strength reduction. Optimization Timing: In this class we will optionally perform optimization after parsing but before interpreting. Our optimizer takes a parse tree as input and returns a (simpler) parse tree as output.

Program Text

Lexing Tokens Parsing Tree

Optimization (optional)

Tree(simpler)

InterpretingResult(meaning)

def optimize(tree):

etype = tree[0]

if etype == “binop”:

a = tree[1]

op = tree[2]

b = tree[3]

if op == “*” and b == (“number”, “1”):

return a

if op == "*" and b == ("number", "0"):

return ("number", "0") # or return b

if op == "+" and b == ("number", "0"):

return a

return tree

Rebuilding The Parse Tree: We desire an optimizer that is recursive. We should optimize the child nodes of a parse tree before optimizing the parent nodes. 1. Recursive calls, 2. Look for patterns, 3. Done.

def optimize(tree): # Expression trees only

etype = tree[0]

if etype == "binop":

a = optimize(tree[1])

op = tree[2]

b = optimize(tree[3])

if op == "*" and b == ("number","1"):

return a

elif op == "*" and b == ("number","0"):

return ("number","0")

elif op == "+" and b == ("number","0"):

return a

return tree

return tree Wrap Up:

(“binop”, (“number”, “5”), “*”, (“number”, “1”))

(“number”, “5”)

Lexing

Parsing

Optimizing

Interpreting

Debugging

regular expressionsfinite state machines

context free grammarsdynamic programming / parse trees

must retain meaning

walks A.S.T. recursively

gain confidence

HTML embedded in JavaScript embedded in HTML embedded in JavaScript...: import ply.lex as lex

import ply.yacc as yacc

import graphics as graphics

import jstokens

import jsgrammar

import jsinterp

import htmltokens

import htmlgrammar

htmllexer = lex.lex(module=htmltokens)

htmlparser = yacc.yacc(module=htmlgrammar,tabmodule="parsetabhtml")

jsparser = yacc.yacc(module=jsgrammar,tabmodule="parsetabjs")

def interpret(ast):

for node in ast:

nodetype = node[0]

if nodetype == "word-element":

graphics.word(node[1])

elif nodetype == "tag-element":

tagname = node[1];

tagargs = node[2];

subast = node[3];

closetagname = node[4];

if (tagname <> closetagname):

graphics.warning("(mistmatched " + \

tagname + " " + closetagname + ")")

graphics.begintag(tagname,tagargs);

interpret(subast)

graphics.endtag();

elif nodetype == "javascript-element":

jstext = node[1];

jsast = jsparser.parse(jstext,lexer=jslexer)

result = jsinterp.interpret(jsast)

htmlast = htmlparser.parse(result,lexer=htmllexer)

interpret(htmlast)

webpage = “““ ... ”””

htmlast = htmlparser.parse(webpage,lexer=htmllexer)

graphics.initialize()

interpret(htmlast)

graphics.finalize()

Bending Numbers: # Write a procedure optimize(exp) that takes a JavaScript expression AST

# node and returns a new, simplified JavaScript expression AST. You must

# handle:

# X * 1 == 1 * X == X for all X

# X * 0 == 0 * X == 0 for all X

# X + 0 == 0 + X == X for all X

# X - X == 0 for all X

# and constant folding for +, - and * (e.g., replace 1+2 with 3)

def optimize(exp):

etype = exp[0]

if etype == "binop":

a = optimize(exp[1])

op = exp[2]

b = optimize(exp[3])

if op == "+" and a == ("number", 0):

return b

elif op == "+" and b == ("number", 0):

return a

if op == "*" and a == ("number", 1):

return b

elif op == "*" and b == ("number", 1):

return a

if op == "*" and (a == ("number", 0) or b == ("number", 0)):

return ("number", 0)

if op == "-" and a == b:

return ("number", 0)

if a[0] == b[0] == "number":

if op == "+":

return ("number", a[1] + b[1])

if op == "-":

return ("number", a[1] - b[1])

if op == "*":

return ("number", a[1] * b[1])

return ("binop", a, op, b)

return exp

The Living and the Dead: # Those lines can be safely removed because they do not compute a value

# that is used later. We say that a variable is LIVE if the value it holds

# may be needed in the future. More formally, a variable is LIVE if its

# value may be read before the next time it is overwritten. Whether or not

# a variable is LIVE depends on where you are looking in the program, so

# most formally we say a variable is live at some point P if it may be read

# before being overwritten after P.

# function myfun(a,b,c,d) {

# a = 1;

# # LIVE: nothing

# b = 2;

# # LIVE: b

# c = 3;

# # LIVE: c, b

# d = 4;

# # LIVE: c, b

# a = 5;

# # LIVE: a, c, b

# d = c + b;

# # LIVE: a, d

# return (a + d);

# Once we know which variables are LIVE, we can now remove assignments to

# variables that will never be read later. Such assignments are called DEAD

# code. Formally, given an assignment statement "X = ...", if "X" is not

# live after that statement, the whole statement can be removed.

# In this assignment, you will write an optimizer that removes dead code.

# For simplicity, we will only consider sequences of assignment statements

# (once we can optimize those, we could weave together a bigger optimizer

# that handles both branches of if statements, and so on, but we'll just do

# simple lists of assignments for now).

# Write a procedure removedead(fragment,returned). "fragment" is encoded

# as above. "returned" is a list of variables returned at the end of the

# fragment (and thus LIVE at the end of it).

# Hint 1: One way to reverse a list is [::-1]

# >>> [1,2,3][::-1]

# [3, 2, 1]

def removedead(fragment,returned):

old_fragment = fragment

new_fragment = []

live = returned

for stmt in fragment[::-1]:

if stmt[0] in live:

new_fragment = [stmt] + new_fragment

live = [x for x in live if x != stmt[0]]

live = live + stmt[1]

if new_fragment == old_fragment:

return new_fragment

return removedead(new_fragment, returned)

Find all the Subsets of a Set: def all_subsets(lst):

pset = [[]]

for elem in lst:

pset += [x + [elem] for x in pset]

return pset

UNIT 7 – Wrap Up Review: A language is a set of strings. Regular Expressions – concise notation for specifying some sets of strings (regular languages). Finite State Machines – pictorial representation + way to implement regular expressions (deterministic or not). Context-Free Grammars – concise notation for specifying some sets of strings (context-free languages). Memoization (also called Dynamic Programming) – keep previous results in a chart to save computation. Lexing – break a big string up into a list of tokens (words) (specified using r.e.). Parsing – determine if a list of tokens is in the language of a CFG. If so, produce a Parse Tree. Type – a type is a set of values and associated safe operations. Semantics (Meaning) – a program may have type errors (or other exceptions) or it may produce a value.

Optimization – replace a program with another that has the same semantics (but use fewer resources). Interpretation – recursive walk over the (optimized) parse tree. the meaning of a program is computed from the meanings of its subexpressions. Web Browser – lex and parse html, treating JS as a special token. HTML interpreter calls JS interpreter which returns a string. HTML interpreter calls graphics library to display them. Security: Computing in the presence of an adversary.

This file is not offered officially by Udacity.com. This material was created by a student as personal notes, while attending the lectures of the course CS262: Programming Languages. This material is offered freely. Lamprianidis Nick Last Edited: 07/27/2012

CS262 Notes (1)

Documents

Transcript of CS262 Notes (1)

CS262 Problem Session - Stanford University · 2016. 1. 12. · !"#$%&'(f*1,*:, (! % ::< % ::: % 2777 %=!678 % = &678 % =!628 % = &628 % =!638 % = &638 % =!69::8 % = &69::8 % =!6;778

CS262 Lecture 12 Notes Single Cell Sequencing · CS262 Lecture 12 Notes Single Cell Sequencing Jan. 11, 2016 Background A typical human cell consists of ~6 billion base pairs of DNA

Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.

History of Cryptography and Cryptanalysis - unibas.chinformatik.unibas.ch/uploads/media/cs262-20130522-2up.pdf · CS262 — FS13 — History of Cryptography and Cryptanalysis "Sources

Basic Molecular Biology for CS262 Omkar Deshpande.

CS262 Lecture Notes: Hidden Markov ModelsCS262 Lecture Notes: Hidden Markov Models Sarah S January 21 2016 1 Summary Last lecture introduced hidden Markov models, and began to discuss

CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.

DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.

Lecture 7: HMMs continued - stanford.edustanford.edu/class/cs262/notes/lecture7.pdf · CS262 Lecture 7 Notes Instructor: Serafim Batzoglou Scribe: Qianying Lin Lecture 7: HMMs continued

Chapter 3 - 1 ADCS CS262/0898/V1 Chapter 3 Applied Cryptography Digital Signature "Vision without action is a daydream. Action without vision is a nightmare"

Chapter 9 - 1 ADCS CS262/0602/V2 Chapter 9 VIRTUAL PRIVATE NETWORK (VPN) "Success is the prize for those who stand true to their ideas!" -- Josh S. Hinds.

CS262 Lecture 11, Win06, Batzoglou Some new sequencing technologies.

Aries - People @ EECS at UC Berkeleybrewer/cs262/Aries2.pdf · 2008. 1. 22. · Title: Aries.pdf Author: brewer Created Date: 9/23/2002 5:35:59 PM

Welcome to CS262!. Goals of this course Introduction to Computational Biology Basic biology for computer scientists Breadth: mention many topics &

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium.

CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Lecture 7: HMMs continued - Stanford University · CS262 Lecture 7 Notes Instructor: Serafim Batzoglou Scribe: Qianying Lin Larger pseudo counts ⇒ Strong prior belief Small pseudo

Welcome to CS262: Computational Genomics

CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.

CS262 Problem Session - Stanford University · 2016. 1. 12. · !"#$%&'(f1,:, (! % ::< % ::: % 2777 %=!678 % = &678 % =!628 % = &628 % =!638 % = &638 % =!69::8 % = &69::8 % =!6;778