Gold Parser

63
6/24/2004 SEKE 2004 1 GOLD: A Grammar Oriented GOLD: A Grammar Oriented Parsing System Parsing System Devin Cook and Du Zhang Devin Cook and Du Zhang Department of Computer Science Department of Computer Science California State University California State University Sacramento, CA 95819-6021 Sacramento, CA 95819-6021

description

Gold Parser

Transcript of Gold Parser

Page 1: Gold Parser

6/24/2004 SEKE 2004 1

GOLD: A Grammar Oriented GOLD: A Grammar Oriented Parsing SystemParsing System

Devin Cook and Du ZhangDevin Cook and Du ZhangDepartment of Computer ScienceDepartment of Computer Science

California State UniversityCalifornia State UniversitySacramento, CA 95819-6021Sacramento, CA 95819-6021

Page 2: Gold Parser

Introduction

• What is a Parser?– Software which breaks a source program into

its various grammatical units w.r.t. a formal grammar

– Used to convert a source program into an internal representation

• Parsing Algorithms– LL Parsers: top-down, predictive– LR / LALR Parsers: bottom-up, shift-

reduce

Page 3: Gold Parser

Motivation

• The common approach to create parsers is through compiler-compiler, or parser generator

• Each parser generator is designed for a specific programming language. There is no consistent parser generator– Different grammatical notations– Features and interfaces of tools vary in both the look

and the behavior

Page 4: Gold Parser

Goals

• Design and implement a generalized parsing system that supports development of multiple programming languages

• Offer a consistent development environment for the language developers

Page 5: Gold Parser

GOLD

• Grammar Oriented Language Developer.

• Separating the component that generates parse tables for a target grammar from the component that does the actual parsing.

• Support the full Unicode character set.

• Include a set of tools that can aid language development process.

Page 6: Gold Parser

System Structure

Builder– Analyzes a target grammar and creates DFA and

LALR parse tables– These tables are saved to a Compiled Grammar

Table file

Compiled Grammar Table file– Intermediary between the Builder and the Engine– The file format is platform independent– Format is designed to be very easy to read and

extend in future versions

Engine– Reads the tables & parses the source text – Can be implemented in different programming

languages – as needed

Page 7: Gold Parser

Development Flow

1. Grammar is defined and loaded– Any text editor can be used

2. Builder– Grammar is analyzed and errors

reported– The parse tables are created and saved

to .cgt file

3. Engine– Reads the tables, parses the source

string, and produces parsing results– Can be implemented in different

programming languages – as needed

Page 8: Gold Parser

The Builder

• GOLD meta-language• Compiled grammar table (.cgt) file• Skeleton program creation for the Engine from

program templates• Interactive source string testing• Display of various parse table information• Export parse tables to a web page, XML file, or

formatted text

Page 9: Gold Parser

GOLD Meta-Language

• The GOLD Meta-Language is used to define a target grammar

• It must not contain features that are programming language dependent

• Its notation is very close to the standards

• It supports all language attributes (including those which cannot be specified using BNF or regular expressions)

Page 10: Gold Parser

GOLD Meta-Language (contd.)

• Format – Parameters are used to specify attributes about the

grammar– Character Sets are used to define the character

domain for terminals– Terminals are defined using regular expressions– Rules are defined using Backus-Naur Form

Page 11: Gold Parser

Defining Parameters

• Used for Name, Author, Case Sensitive, Start Rule, ....

• Parameter names are delimited by double quotes

• Parameters– "Name", "Author", "Version", "About" are

informative– "Start Symbol" specifies the initial / start

rule in the grammar

Page 12: Gold Parser

Parameters

"Name", "Version", "Author", "About"Informative fields. These have no effect on table generation.

"Case Sensitive"If set to True, the system will construct case sensitive tokenizer tables.

"Character Mapping"Some characters overlap ordinal values between ANSI and Unicode. If set the ANSI, the system will populate both.

"Auto Whitespace"If not set to False, the system will automatically define a terminal to accept whitespace.

"Start Symbol"The initial/start rule of the grammar. This parameter is required.

Page 13: Gold Parser

Example Parameters

"Name"    = 'My Programming Language'

"Version" = '1.0 beta'

"Author"  = 'John Q. Public'

"About"   = 'This is a test declaration.'

| 'Multiple lines are available'

| 'by using the "pipe" symbol'

"Case Sensitive" = 'False'

"Start Symbol" = <Statement>

Page 14: Gold Parser

Defining Sets

• Character sets are used to aid the construction of regular expressions used to define terminals

• Literal sets of characters are delimited using ‘[’ and ‘]’

• Names of user-defined sets are delimited by ‘{’ and ‘}’

• Sets can be defined by adding and subtracting previously declared sets

Page 15: Gold Parser

Example Sets

{Bracket} = [']'] ]

{Quote} = [''] '

{Vowels} = [aeiou] aeiou

{Vowels 2} = {Vowels} + [y] aeiouy

{Set 1} = [abc] abc

{Set 2} = {Set 1} + [12] - [c] ab12

{Set 3} = {Set 2} + {Digit} ab0123456789

{Hex Char} = {Digit} + [ABCDEF] 0123456789ABCDEF

Page 16: Gold Parser

Pre-defined Character Sets

• There are many sets of characters which are not accessible via keyboard, or so commonly used that it would be repetitive and time-consuming to redefine in each grammar

• GOLD meta-language contains a collection of useful pre-defined sets

• These include sets often used for defining terminals as well as characters not accessible via keyboard

Page 17: Gold Parser

Individual Characters

• Some control characters that cannot be specified on a standard keyboard

Page 18: Gold Parser

Commonly used Character Sets

{Digit}

{Letter}

{Alphanumeric}

{Printable}

{Whitespace}

{Letter Extended}

{Printable Extended}

{ANSI Mapped}

{ANSI Printable}

Page 19: Gold Parser

Unicode Character Sets

• GOLD meta-language contains 43 pre-defined Unicode character sets

• The names of those sets are based on standard names of the Unicode Consortium

Page 20: Gold Parser

Comments

• GOLD meta-language allows both line comments and block comments

Page 21: Gold Parser

Defining Terminals

• Terminals are used to define reserved words, symbols, and recognized patterns (identifiers) in a grammar

• Each terminal is defined using a regular expression which is used to construct the Deterministic Finite Automata used by the tokenizer

• Implicit declaration of frequently used reserved words and symbols

Page 22: Gold Parser

Example Terminals

Example1 = a b c* ab, abc, abcc, abccc, ...

Example2 = a b? c abc, ac

Example3 = a|b|c a, b, c

Example4 = a[12]*b ab, a1b, a2b, a12b, a21b, ...

Example5 = {Letter}+ cat, dog, Sacramento, ...

ListFunction = c[ad]+r car, cdr, caar, cadr, ...

Page 23: Gold Parser

Defining Rules

• Use Backus-Naur Form• Nonterminals are delimited by angle brackets <

and >• Terminals are delimited by single quotes or not

delimited at all

Page 24: Gold Parser

Example: Lists

• Lists are specified using recursive rules

Identifier = {Letter}{Alphanumeric}*

<List> ::= <List Item> ',' <List>         | <List Item>

<List Item> ::= Identifier

Recursion

Page 25: Gold Parser

Example: Optional Rules

• Optional rules are specified with a production containing no terminals

• This allows the developer to both specify a list containing 0 or more members

<Series> ::= <s-Expression> <Series>

           |

<Quote> ::= ''

          | Optional Rule

zero or more

Page 26: Gold Parser

"Name"    = 'LISP'

"Author"  = 'John McCarthy'

"Version" = 'Minimal'

"About"   = 'LISP organizes ALL data around "lists".'

"Start Symbol" = <s-Expr>

{Atom Char}   = {Printable} - {Whitespace} - [()"\'']

Atom = ( {Atom Char} | '\'{Printable} )+

<s-Expr> ::= <Quote> Atom

           | <Quote> '(' <Series> ')'

           | <Quote> '(' <s-Expr> '.' <s-Expr> ')'

<Series> ::= <s-Expression> <Series>

           |

<Quote> ::= ''

          |

Example: LISP Grammar

Page 27: Gold Parser

"Name"    = 'LISP'"Name"    = 'LISP'

"Author"  = 'John McCarthy'"Author"  = 'John McCarthy'

"Version" = 'Minimal'"Version" = 'Minimal'

"About"   = 'LISP organizes ALL data around "lists".' "About"   = 'LISP organizes ALL data around "lists".'

"Start Symbol" = <s-Expr>"Start Symbol" = <s-Expr>

{Atom Char}   = {Printable} - {Whitespace} - [()"\'']

Atom = ( {Atom Char} | '\'{Printable} )+

<s-Expr> ::= <Quote> Atom

           | <Quote> '(' <Series> ')'

           | <Quote> '(' <s-Expr> '.' <s-Expr> ')'

<Series> ::= <s-Expression> <Series>

           |

<Quote> ::= ''

          |

Example: LISP Grammar

Parameters

Initial Rule

Page 28: Gold Parser

"Name"    = 'LISP'

"Author"  = 'John McCarthy'

"Version" = 'Minimal'

"About"   = 'LISP organizes ALL data around "lists".'

"Start Symbol" = <s-Expr>

{Atom Char}   = {Printable} - {Whitespace} - [()"\'']{Atom Char}   = {Printable} - {Whitespace} - [()"\'']

Atom = ( {Atom Char} | '\'{Printable} )+

<s-Expr> ::= <Quote> Atom

           | <Quote> '(' <Series> ')'

           | <Quote> '(' <s-Expr> '.' <s-Expr> ')'

<Series> ::= <s-Expression> <Series>

           |

<Quote> ::= ''

          |

Example: LISP Grammar

Set Definition

Set Literal

Page 29: Gold Parser

"Name"    = 'LISP'

"Author"  = 'John McCarthy'

"Version" = 'Minimal'

"About"   = 'LISP organizes ALL data around "lists".'

"Start Symbol" = <s-Expr>

{Atom Char}   = {Printable} - {Whitespace} - [()"\'']

Atom = ( {Atom Char} | '\'{Printable} )+ Atom = ( {Atom Char} | '\'{Printable} )+

<s-Expr> ::= <Quote> Atom

           | <Quote> '(' <Series> ')'

           | <Quote> '(' <s-Expr> '.' <s-Expr> ')'

<Series> ::= <s-Expression> <Series>

           |

<Quote> ::= ''

          |

Example: LISP Grammar

Terminal Definition

Page 30: Gold Parser

"Name"    = 'LISP'

"Author"  = 'John McCarthy'

"Version" = 'Minimal'

"About"   = 'LISP organizes ALL data around "lists".'

"Start Symbol" = <s-Expr>

{Atom Char}   = {Printable} - {Whitespace} - [()"\'']

Atom = ( {Atom Char} | '\'{Printable} )+

<s-Expr> ::= <Quote> Atom <s-Expr> ::= <Quote> Atom

                      | <Quote> '(' <Series> ')' | <Quote> '(' <Series> ')'

                      | <Quote> '(' <s-Expr> '.' <s-Expr> ')'| <Quote> '(' <s-Expr> '.' <s-Expr> ')'

<Series> ::= <s-Expression> <Series> <Series> ::= <s-Expression> <Series>

                      | |

<Quote> ::= ''<Quote> ::= ''

                    ||

Example: LISP Grammar

Optional Rule

RecursiveRule

Rules

Page 31: Gold Parser

Compiled Grammar Table File

• A file format designed to store parse tables and other information generated by the Builder

• Design considerations– Easy to implement on different platforms– Flexibility for data structures to be added or

expanded– Room for future growth (additional new types of data)

Page 32: Gold Parser

.cgt File Structure

• The file consists of a number of records• Each record contains a number of entries

Page 33: Gold Parser

.cgt Record

• The header contains name and version info• A record has the following format

Page 34: Gold Parser

Parameter Record

• Parameter record which only occurs once in the .cgt file. It contains information about the grammar as well as attributes that affect how the grammar functions. The record is preceded by a byte field contains the value 80, the ASCII code for the letter 'P'.

Page 35: Gold Parser

Table Size Record

• Table size record : that appears before any records containing information about symbols, sets, rules or state table information. The first field of the record contains a byte with the value 84 - the ASCII  code for the letter 'T’ Each value contains the total number of objects for each of the listed tables

Page 36: Gold Parser

Other Types of Records

• Character set table member• Symbol table member• Initial states (for both DFA and LALR)• Rule table member• DFA state table member• LALR state table member

Page 37: Gold Parser

An Example cgt File

• An example grammar "Name" = 'Example'

"Version" = '1.0‘

"Author" = 'Devin Cook'

"About" = 'N/A'

"Start Symbol" = <Stms>

<Stms> ::= <Stm> <Stms>

| <Stm>

<Stm> ::= if <Exp> then <Stms> end

| Read Id

| Write <Exp>

<Exp> ::= Id '+' <Exp>

| Id '-' <Exp>

| Id

Page 38: Gold Parser

Table Content

• Symbol Table========================================Symbol Table========================================Index Name----- ------------0 (EOF)1 (Error)2 (Whitespace)3 '-'4 '+'5 end6 Id7 if8 Read9 then10 Write11 <Exp>12 <Stm>13 <Stms>

Page 39: Gold Parser

Table Content (2)

• Rules========================================

Rules

========================================

Index Name ::= Definition

----- ------ --- ------------------------

0 <Stms> ::= <Stm> <Stms>

1 <Stms> ::= <Stm>

2 <Stm> ::= if <Exp> then <Stms> end

3 <Stm> ::= Read Id

4 <Stm> ::= Write <Exp>

5 <Exp> ::= Id '+' <Exp>

6 <Exp> ::= Id '-' <Exp>

7 <Exp> ::= Id

Page 40: Gold Parser

Table Content (3)

• Character Set Table========================================

Character Set Table

========================================

Index Characters

----- ---------------------------------

0 {HT}{LF}{VT}{FF}{CR}{Space}{NBSP}

1 +

2 -

3 Ee

4 Ii

5 Rr

6 Tt

7 Ww

8 Nn

9 Dd

10 Ff

11 Aa

12 Hh

Page 41: Gold Parser

Table Content (3)

• DFA states========================================DFA States========================================Index Description Character Set-------- ------------------- -------------0 Goto 1 0 Goto 2 1 Goto 3 2 Goto 4 3 Goto 7 4 Goto 10 5 Goto 14 6 Goto 18 71 Goto 1 0 Accept (Whitespace)…………

Page 42: Gold Parser

Table Content (4)

• LALR states========================================

LALR States

========================================

Index Configuration/Action

-------- ------------------------------------

0 if Shift 1

Read Shift 9

Write Shift 11

<Stm> Goto 13

<Stms> Goto 17

1 <Stm> ::= if · <Exp> then <Stms> end

Id Shift 2

<Exp> Goto 7

…………

Page 43: Gold Parser

cgt File for the grammar

• To illustrate, only one of each record type is included

6

1

'M'

2

'b' 'T'

1 1

'I' 14

1 2

Symbol Table

'I' 13

1 2

Character SetTable

'I' 8

1 2

Rule Table

'I' 23

1 2

DFA TableTable Counts

'I' 18

1 2

LALR Table

4

1

'M'

2

'b' 'S'

1 1

'I' 0

1 2

IndexSymbol

'S'

Name

EOF

1 4

'I' 3

1 2

Kind

3

1

'M'

2

'b'

CharacterSet

'C'

1 1

'I'

Index

4

1 2

'S'

Characters

Ii

1 6

6

1

'M'

2

'b' 'R'

1 1

'I' 0

1 2

IndexRule

'I' 13

1 2

Nonterminal

'E'

1

(Empty)

'I' 12

1 2

Symbol 0

'I' 13

1 2

Symbol 1

8

1

'M'

2

'b' 'D'

1 1

'I' 1

1 2

IndexDFA State

'I' 2

1 2

Accept Index

'E'

1

(Empty)

'I' 0

1 2

Character Set Index0

'I' 1

1 2

Target Index0

'B' 1

1 1

Accept State

'E'

1

(Empty)0

7

1

'M'

2

'b' 'L'

1 1

'I' 7

1 2

IndexLALR State

'E'

1

(Empty)

'I' 1

1 2

Action0

'I' 8

1 2

Target0

'E''I' 9

1 2

Symbol Index 0 (Empty)0

1

7

1

'M'

2

'b'

Parameters

'P'

1 1

'S'

Name

Example

1 16

'S'

Version

1.0

1 8

'S'

Author

Devin Cook

1 22

'B'

CaseSensitive

0

1 1

'S'

Start Symbol

13

1 2

'S'

About

N/A

1 8

DFAInitial States LALR

3

1

'M'

2

'b' 'I'

1 1

'I' 0

1 2

'I' 0

1 2

Page 44: Gold Parser

The Remaining Builder Features

• Besides meta-language and .cgt file, – Skeleton program creation for the Engine from

program templates– Interactive source string testing– Display of various parse table information– Export parse tables to a web page, XML file, or

formatted text

Page 45: Gold Parser

Online Help

Application Layout

Status Message

Grammar Editor

Next Button

Toolbar

Page 46: Gold Parser

Program Templates

• When developing the Engine which is interacting with tables of rules and symbols in the .cgt file, manually typing constant definitions can be tedious and problematic

• Program templates are designed to help automate the Engine development

• The Builder can use a program template to create a “skeleton program” for an implementation of the Engine

Page 47: Gold Parser

Program Templates (contd.)

• Skeleton program contains– Necessary declarations of

constants and variables– Function calls– Case statements, pre-processor

statements– Ready-to-use programs

• Notation designed to not conflict with known languages

• Program templates are saved in a subfolder

Page 48: Gold Parser

Display of Symbol Table

• Symbol table display

Page 49: Gold Parser

Display of Rule Table

• Rule table display

Page 50: Gold Parser

Display of Log Information

• Log info: general information about the number of symbols, which ones were defined implicitly, table counts, and any errors that occur

Page 51: Gold Parser

Display of DFA State Table

• DFA state table

Page 52: Gold Parser

Display of LALR State Table

• LALR state table

Page 53: Gold Parser

Export Parse Tables

• Parse tables can be exported to a web page, formatted text, or an XML file

Page 54: Gold Parser

Web Page Export

• An example of webpage export

Page 55: Gold Parser

A Short Demo

• A simple grammar• ANSI C

Page 56: Gold Parser

The Engine

• Different implementations of the Engine• Object-oriented approach• Its design is centered around the object

of “GOLDParser”, which performs all the parsing logic

• The remaining objects are used for storage or to support GOLDParser object

• Available in: Visual Basic .NET, ANSI C, C#, C++ (MFC), Delphi 5 & 6, Java, Python, Visual Basic 6

Page 57: Gold Parser

Testing and Development

• Extensive tests on the Builder’s algorithms to generate the LALR and DFA tables– Small grammars– Grammars for the real world programming languages

(e.g., ANSI C, BASIC, COBOL, LISP, Smalltalk, SQL, Visual Basic .NET, HTML, XML)

• A Visual Basic 6 version of the Engine was developed as an integral part of the GOLD system and was tested

Page 58: Gold Parser

Comparison

• Yacc: for C or C++ on UNIX platform• ANTLR: OO parser generator that works for C++,

C#, and Java• Bison: Yacc compatible• Elkhound: parser generator that is based on

generalied LR algorithm• GENOA: framework for code analysis tools that

has a parsing front end

Page 59: Gold Parser

Free Parsing Systems

Language GOLD YACC ANTLR Grammatica

ANSI C C++ C# Delphi 5 & 6 Java Python Visual Basic 5 & 6 Visual Basic .NET All .NET Languages All ActiveX Languages

Page 60: Gold Parser

Benefits of GOLD

• It supports development of multiple programming languages and the full Unicode character set

• It has a set of development tools• Its meta-language is easy to understand and its

Builder GUI is easy to use

Page 61: Gold Parser

Contributors to Different Engines

Manuel Astudillo C++ implementation

Max Battcher Recompiled .NET Source to a DLL

Matthew Hawkins Java implementation

Justin Holmes ANSI C implementation

Ibrahim Khachab Modified Delphi version

Marcus Klimstra C# implementation

Milosz A. Krajewski Python implementation

Alexandre Rai Delphi implementation

Eylem Ugurel C++ implementation

Martin van der Geer Delphi implementation

Robert van Loenhout C# implementation

Reggie Wilbanks Ported the Engine to Visual Basic .NET

Page 62: Gold Parser

Website

• The URL for the GOLD website

http://www.devincook.com/goldparser

• On average, approximately 3000 copies of the Builder application are downloaded per month

• Latest news: known bugs, workarounds, new releases

• Contributor section• Online documentation

Page 63: Gold Parser

Future Work

• Port the Builder to UNIX and Linux

• Enhancement to the meta grammar