Essays/Recursive Descent Parser

From J Wiki
Jump to navigation Jump to search

Wm yes check.png


Parser

Parser applies grammar rules of a formal language to an input stream of tokens. If successful, the parsing process signifies that the input indeed represents correct sequence from the language; the output of parsing can be an AST (abstract syntax tree) of the grammatical structure of language elements or the result of structured evaluation described by the sequence.

Recursive Descent

Recursive descent is a top-down algorithm consisting of mutually recursive operations, such as traversal of a tree structure or parsing. Download script: File:Recdesc.ijs

NB. recdesc - recursive descent parser
NB. 10/30/06 Oleg Kobchenko
NB. 04/05/07 Oleg Kobchenko literate
NB. 10/09/07 Oleg Kobchenko consistent with recdesc2

«grammar»

«lexer utilities»

NB. =========================================================
NB. common single class
coclass 'pcalc'

NB. tokens
'S_SPACE S_NUM S_ADD S_SUB S_MUL S_DIV S_LPAR S_RPAR S_ERR'=: i. 9
S_EOF=: _1

destroy=: codestroy

NB. =========================================================
NB. Regex Lexer, Token Pump and Accessors

«lexer»

«parser class»

«token pump»

«grammar rules»

«runner»

«public interface»

«test»

Simple Calculator

As a simple example, we will describe the language of an arithmetic calculator with traditional precedence rules: [{{#file: "grammar"}} Download script: grammar ]

NB. The following grammar rules are used:
NB.
NB. factor ::= ('+'|'-')* ( num | '(' expr ')' )
NB. term   ::= factor ( ('*'|'%') factor )*
NB. expr   ::= term ( ('+'|'-') term )*
NB. line   ::= expr EOF

such that it will be able to evaluate the following expressions [{{#file: "accept test"}} Download script: accept test ]

NB. accept
calc '11+22'
calc '5+2*10'
calc '(5+2)*10'
calc '(11+22)/-(3.0*2/2)'
calc '(11+22)*+(-1-2)'

and raise errors on such expressions [{{#file: "reject test"}} Download script: reject test ]

NB. reject
calc '22+3/'
calc '22+3/(1+)'
calc '1+abc/2'

Lexer

Using Regex Lexer. [{{#file: "lexer utilities"}} Download script: lexer utilities ]

NB. =========================================================
NB. regex lexer utilitites
require 'regex'
coclass 'z'

lxdefine=: 1 : '''(?x)'' , m : 0'
lxresult=: ((1 i.~[:*@,(1 1,:_ 1)&(];.0)) , {.)"2
lxmatches=: lxresult@rxmatches
lxfrom=: >@(([:' ['&,,&']')&.>)@rxfrom
lxview=: lxmatches (":@[ ,. }."1@[ lxfrom ]) ]

Lexer will reside in the same class as parser, pcalc. We define the following tokens [{{#file: "lexer"}} Download script: lexer ]

NB. =========================================================
NB. lexer part

LEX1=: noun lxdefine
( \s+     )|# 0 space
( [0-9.]+ )|# 1 number
( \+      )|# 2 add
( -       )|# 3 sub
( \*      )|# 4 mul
( /       )|# 5 div
( \(      )|# 6 lpar
( \)      )|# 7 rpar
( .+      ) # 8 error
)

The following test cases [{{#file: "lexer test"}} Download script: lexer test ]

LEX1_pcalc_ lxview '5+2*10'
LEX1_pcalc_ lxview '(3.0*2/2)'

will produce such scanning results

   LEX1_pcalc_ lxview '5+2*10'
1 0 1 [5]
2 1 1 [+]
1 2 1 [2]
4 3 1 [*]
1 4 2 [10]
   LEX1_pcalc_ lxview '(3.0*2/2)'
6 0 1 [(]
1 1 3 [3.0]
4 4 1 [*]
1 5 1 [2]
5 6 1 [/]
1 7 1 [2]
7 8 1 [)]

Parser Class

The class will also contain creation, destruction and maintenance routines. Each instance will accept an input string, stored in STR. The list of tokens from scanner will be in TOK. [{{#file: "parser class"}} Download script: parser class ]

NB. =========================================================
NB. creation

create=: 3 : 0
  STR=: y
  TOK=: LEX1 lxmatches STR
  I=: _1
  if. (#TOK)>ie=. S_ERR i.~{."1 TOK do.
    'characters' unexpect ie
  end.
  next''
)

The static cover verb run, calls the top grammar rule line. [{{#file: "runner"}} Download script: runner ]

NB. =========================================================
NB. runner

run=: 3 : 0
  p=. ''
  try.
    p=. y conew 'pcalc'
    r=. line__p''
  catch.
    smoutput 13!:12 ''
    r=. i.0 0
  end.
  if. #p do. destroy__p'' end.
  r
)

Verb calc is the public interface for the parser. It accepts a string and returns the result. [{{#file: "public interface"}} Download script: public interface ]

NB. =========================================================
NB. public interface
calc_z_=: run_pcalc_

Token pump and accessor verbs will be defined as follows. Note, at the end of tokens, an implicit token EOF is inserted. [{{#file: "token pump"}} Download script: token pump ]

NB. =========================================================
NB. token pump and accessors

next=: 3 : 0
  I=: I+1
  if. I<#TOK do. SYM=: {.I{TOK else. SYM=: S_EOF end.
  1
)

str=: 3 : 0
  if. y>:#TOK do. 'EOF' return. end.
  's b n'=. y{TOK
  (b,:n)];.0 STR
)
pos=: 3 : 0
  if. y>:#TOK do. _1 return. end.
  1{y{TOK
)
val=: 3 : '0".str y'

unexpect=: 'symbol'&$: : (4 : 0)
  y=. {.y,I
  error=. 'Unexpected ',x,' "',(str y),'" at ',":pos y
  error assert 0
)

Grammar Rules

First pump interface verbs [{{#file: "grammar rules"}} Download script: grammar rules ]

NB. =========================================================
NB. parser part

accept=: 3 : '0:`next@.(SYM&=) y'
expect=: unexpect`1:@.accept

Finally the grammar rules [{{#file: "grammar rules"}} Download script: grammar rules ]

NB. =========================================================
NB. grammar rules

NB. factor ::= ('+'|'-')* ( num | '(' expr ')' )
factor=: 3 : 0
  s=. 1
  while. 1 do.
    if.     accept S_SUB do. s=. s*_1
    elseif. accept S_ADD do.
    elseif.              do. break. end.
  end.

  if.     accept S_NUM  do. r=. val I-1
  elseif. accept S_LPAR do. r=. expr ''
          expect S_RPAR
  elseif.               do. 'factor' unexpect '' end.

  s*r
)

NB. term ::= factor ( ('*'|'%') factor )*
term=: 3 : 0
  r=. factor ''
  while. 1 do.
    if.     accept S_MUL do. r=. r * factor''
    elseif. accept S_DIV do. r=. r % factor''
    elseif.              do. break. end.
  end.
  r
)

NB. expr ::= term ( ('+'|'-') term )*
expr=: 3 : 0
  r=. term ''
  while. 1 do.
    if.     accept S_ADD do. r=. r + term''
    elseif. accept S_SUB do. r=. r - term''
    elseif.              do. break. end.
  end.
  r
)

NB. line ::= expr EOF
line=: 3 : 0
  r=. expr ''
  expect S_EOF
  r
)

Testing

The following are some test results, [{{#file: "test"}} Download script: test ]

NB. =========================================================
Note 'Tests'
«lexer test»

«accept test»

«reject test»
)

which is what we expect.

   calc '11+22'
33
   calc '5+2*10'
25
   calc '(5+2)*10'
70
   calc '(11+22)/-(3.0*2/2)'
_11
   calc '(11+22)*+(-1-2)'
_99
   calc '22+3/'
Unexpected factor "EOF" at _1
   calc '22+3/(1+)'
Unexpected factor ")" at 8
   calc '1+abc/2'
Unexpected characters "abc/2" at 2

Further Enhancements

The resulting parser gets the job done, but it's not very flexible in terms of extensibility for different token sourses and output targets. So the following enhancements can be considered. (See File:Recdesc2.ijs and diff.)

  • provide a separate lexer interface that encapsulates the token pump logic, so that different sources including unterminated streams are possible
accept=: 3 : '0:`next__lex@.sym__lex y'
expect=: 3 : 'unexpect__lex`1:@.accept y'
  • provide a separate consumer class family with AST node creation interface that will be called by the parser rules
term=: 3 : 0
  r=. factor ''
  while. 1 do.
    if.     accept S_MUL do. r=. r '*' binop__cons factor''
    elseif. accept S_DIV do. r=. r '%' binop__cons factor''
    elseif.              do. break. end.
  end.
  r
)
  • implementation of consumers will decide how to treat the calls: calculate, build tree, etc. Here is a complete consumer which builds a boxed tree (See results below).
NB. =========================================================
NB. Tree Consumer, static verbs
coclass 'pCalcConsTree'

binop =: 1 : (':';'<x,m;y')
unop  =: 1 : '<m;y'
val   =: <
  • the cover public verbs will use the provider pattern
eval_z_=: 'pCalcConsEval'&run_pCalcParse_
tree_z_=: 'pCalcConsTree'&run_pCalcParse_
  • better error handling using asserts -- releasing objects and reporting last error -- via the lexer, because it contains the token positions and values
  • skip space tokens in the lexer transparently to the parser

Here are some results of the tree consumer and new error messages.

   tree '5 + 2*10'
+-+-+--------+
|5|+|+-+-+--+|
| | ||2|*|10||
| | |+-+-+--+|
+-+-+--------+

   tree '22+3/(1+)'
|Unexpected factor ")" at 8: assert
|   error     assert 0

JSON Parser Example

JSON (JavaScript Object Notation) is a lightweight data-interchange format based on declarative syntax for JavaScript structural constant definitions. Besides primitive string and numeric constants it has lists [elm1, elm2, ...] and hash tables {key1 : val1, key2 : val2, ...}, whose elements are recursively defined.

For lexer here we use J ;: Words primitive, which almost perfectly suits our needs, with a few limitations: strings must be enclosed in apostrophes '', colon must be preceded by space, etc.

Both examples have the following grammar. A notable feature is that the bracketed structures are split between two entries.

NB. Val  ::= '[' List
NB.        | '{' Hash
NB.        | token
NB. List ::= ']'
NB.        | Val (',' Val)* ']'
NB. Hash ::= '}'
NB.        | Pair (',' Pair)* '}'
NB. Pair ::= Val ':' Val

File:Json.ijs uses a simpler token pump based only on the verb token.

File:Json1.ijs uses the accept/expect pattern as in pcalc above.

Lists are stored as boxed lists and hash tables as boxed two-row tables of keys and values. For example, for input

T1=: 0 : 0
  [123, 'ab c', {qq : zz, asb :[12,34]},
    [3, 5, [], {a :[]}, zz]]
)

the result is

   p=. conew'ppar1'
   >parse__p T1
+---+------+------------+-------------+
|123|'ab c'|+--+-------+|+-+-++---+--+|
|   |      ||qq|asb    |||3|5||+-+|zz||
|   |      |+--+-------+|| | |||a||  ||
|   |      ||zz|+--+--+||| | ||+-+|  ||
|   |      ||  ||12|34|||| | ||| ||  ||
|   |      ||  |+--+--+||| | ||+-+|  ||
|   |      |+--+-------+|+-+-++---+--+|
+---+------+------------+-------------+
   codestroy__p''

See Also