A New Parser - Neil Mitchell© Neil Mitchell 2004 4 Parsing is a function From Text to Abstract...

28
© Neil Mitchell 2004 1 A New Parser Neil Mitchell

Transcript of A New Parser - Neil Mitchell© Neil Mitchell 2004 4 Parsing is a function From Text to Abstract...

  • © Neil Mitchell 2004 1

    A New Parser

    Neil Mitchell

  • © Neil Mitchell 2004 2

    Disclaimer

    This is not something I have done as part of my PhDI have done it on my ownI haven’t researched other systemsIts not finishedAny claims may turn out to be wrong

  • © Neil Mitchell 2004 3

    Overview

    What is parsing?What systems exist?What do you want to parse?What is my system?Why is mine better (or not)?How do you interact with a parser?

  • © Neil Mitchell 2004 4

    Parsing is a function

    From Text to Abstract Syntax TreeParse :: String -> Tree(Token)How?

    Hand codedUsing Lex/Yacc (or Flex/Bison)Parser combinatiorsEtc…

  • © Neil Mitchell 2004 5

    What Systems Exist?

    Lex/Yacc – the classicCreated to write parsers for CSteps

    Write a grammar file, including C codeGenerate a C fileCompile C fileLink to your code

  • © Neil Mitchell 2004 6

    Lex

    Lex :: String -> List(Token)

    Uses Regular Expressions to split up a string into various lexemes.

    Runs in O(n), using Finite State Automata.

  • © Neil Mitchell 2004 7

    Yacc

    Yacc :: List(Token) -> Tree(Token)

    Based on a BNF grammar.Runs in just over O(n), using an LALR(1)

    stack automaton.Often fails unpredictably…

  • © Neil Mitchell 2004 8

    Early

    Early :: List(Token) -> Tree(Token)

    Almost identical to Yacc, but removes the unpredictable failures, requiring less knowledge of LALR(1)

    A fair bit slower, worst case of O(n3) or O(n2.6) depending on implementation.

  • © Neil Mitchell 2004 9

    Parser

    ParserC = Yacc . LexParserHaskell = Happy . AlexParserJava = …

    Very language dependant, Yacc/Lex both tied to C

  • © Neil Mitchell 2004 10

    Bad points

    Language dependantYacc – shift/reduce conflictNot CFGNot very intuitive to write Yacc

    Summary: Lex good, Yacc bad

  • © Neil Mitchell 2004 11

    What do you want to parse?

    Languages: Haskell, C#, JavaConfigurations: INI, XMLGrammar files for this toolNOT: Perl, Latex, HTML, C++

    Insane syntaxHorrid historyTwisted parody of languages

  • © Neil Mitchell 2004 12

    Brackets, Strings, Escapes

    Brackets () [] {} - YaccStrings “” ‘’ – LexAre strings not brackets, just which disallow nesting?What about escape characters?

    Parse them in Lex: “((\.)|.)*”Re-parse them later

  • © Neil Mitchell 2004 13

    My System

    Bracket :: String -> Tree(Token)Lex :: String -> List(Token)Group :: Tree(Token) -> Tree(Token)

    Parser :: String -> Tree(Token)Parser = Group . map Lex . Bracket

  • © Neil Mitchell 2004 14

    Bracket

    Match brackets, strings, escape charsDefine nesting

    main = 0 * all [lexer]all : round stringround = "(" ")" allstring = "\"" "\"" escape [raw]escape = "\\" 1

  • © Neil Mitchell 2004 15

    Lex

    Same as traditional Lex, but…Easier – no need to do string escapingCan be different for different parts

    In comments use [none]In strings use [raw]Can have many lexers for different parts

  • © Neil Mitchell 2004 16

    Lex (2)

    keyword = `[a-zA-Z][a-zA-Z0-9_]*`number = `[0-9]+`white = `[ \t]`star = "*"eq = "="for.while.

  • © Neil Mitchell 2004 17

    Group

    Group :: Tree(Token) -> Tree(Token)Id :: a -> a

    Therefore "Group = Id" works

    Sometimes you need a higher level of structure, what the brackets meanThe most complex element (unfortunately)

  • © Neil Mitchell 2004 18

    Group (2)

    root = main[*{rule literal}]

    rule = line[ keyword eq {regexp string} ]

    literal = line[ keyword dot ]

  • © Neil Mitchell 2004 19

    Summary of BLG

    Complete lack of embedded C/HaskellData format defined generically

    Can be Haskell linked listCan be C arrayThere is an XML format defined

    Similar in style to each otherAll "simple" langauges

  • © Neil Mitchell 2004 20

    Implementation

    BracketDeterministic Push down stack automata

    LexSteal existing lex, FSA

    GroupFSA? Maybe…Have a sketched automaton

  • © Neil Mitchell 2004 21

    Implementation (2)

    I have implemented most of it in C#Slow, but very useableBracket seems pretty perfectLex uses Regex objects, but worksGroup is less complete, uses backtracking, doesn't have maximal munch semantics, NP, etc.

  • © Neil Mitchell 2004 22

    Implementation (3)

    BLG is self-parsing ☺1 Lex file for all 31 Bracket file for all 33 Group files, one each

    Reuse is good

  • © Neil Mitchell 2004 23

    Interaction

    How do you interact with a parser?Yacc/Lex

    Translate, Compile, Link, Execute

    BLGTranslate, Compile, Link, ExecuteCompile into resource fileLoad at runtime (Text Editors)

  • © Neil Mitchell 2004 24

    Exclamations!

    BLG defines a complete set of exclamations which allow for code hoisting and deletingRemove tokens from the output (white space/comments)Promote tokens, i.e. line![ x ] returns xSimple, but ignored here

  • © Neil Mitchell 2004 25

    $Directives

    In the Bracket, before any processingStream processing directives$text (remove '\r', append '\n')$tab-indent (for Haskell/Python)$upper-caseEasy, simple, generic, reusable

  • © Neil Mitchell 2004 26

    Advantages

    Language neutralHaskell parsing

    GHC in HaskellHugs in CCould now use the same grammar

    Can reuse elements, i.e. Lex and Bracket are almost identical for C#/Java

  • © Neil Mitchell 2004 27

    But best of all

    The grammars are really easy to specifyA bit of a leapWould need years of hypothesis testingAnd maybe even a working implementation

    FasterAlmost irrelevant, thanks to faster computers

  • © Neil Mitchell 2004 28

    Questions?

    What did I explain badly?I would really appreciate any feedback!Should I ditch the entire idea?Should I implement it?Should I give up my PhD to sell this

    system?

    A New ParserDisclaimerOverviewParsing is a functionWhat Systems Exist?LexYaccEarlyParserBad pointsWhat do you want to parse?Brackets, Strings, EscapesMy SystemBracketLexLex (2)GroupGroup (2)Summary of BLGImplementationImplementation (2)Implementation (3)InteractionExclamations!$DirectivesAdvantagesBut best of allQuestions?