Essays/Regex Lexer

Lexer (or scanner) is a program that performs lexical analysis, breaking a stream of characters into a list of consecutive groups, called tokens. Download File:Rxlex.ijs.

Lexer

Lexical analysis is typically performed by a sequential machine, such as dyad ;: Sequential Machine. However, it requires the state table, which is very hard to construct by hand for anything but very simple rules.

To simplify lexer construction, tools such as UNIX lex are used, which are based on regular expressions. And implementations of regular expressions matching, themselves employ state machines.

So having regular expressions facility, such as regex library in J, it is already possible to build a decent lexer in a fairly straightforward way.

Principles

We will start with a simple case. Tokens are alphabetical words separated with spaces:

TEST1a=: 'one two   three'              NB. normal test
TEST1b=: 'one two1 three'               NB. error
TEST1c=: ($~ 1000*#)'one two   three '  NB. 16000 chars

To represent a token, we will use a regex group, i.e. a subpattern within (...). An instance of match will be the disjunction of such groups, (...)|(...)|...|(...). Then the token ID will be the index of the group in the match array.

Thus we obtain for our case,

LEX1=: '(\s+)|([a-z]+)|(.+)'            NB. space | word | error

We are ready to roll. But first we apply some notation sugar Utilities at the end of this text.

   LEX1 lxview TEST1a
1  0 3 [one]
0  3 1 [ ]
1  4 3 [two]
0  7 3 [   ]
1 10 5 [three]

   LEX1 lxview TEST1b
1 0 3 [one]
0 3 1 [ ]
1 4 3 [two]
2 7 7 [1 three]

The format of lxview is character rows per token of (index,start,size,content).

It uses numeric matrix of lxmatches with rows of (index,start,size).

Performance is quite reasonable, considering the amount of effort to define the lexer.

   ts 'LEX1 lxmatches TEST1c'
0.903204 396032

Details

As with other lexer builders, there are certain factors that need to be considered to obtain a successful lexer.

Defining Groups

Groups in the regex correspond to token types, so it is important not to introduce any groups within or outside tokens. Instead, we use non-capturing subpatterns (?:...).

The groups are matched on found-first basis, so care should be taken about the order of groups.

Error Handling

Error in lexical analysis can be thought of as a non-match, or in other words, a gap between two matches.

While it could be possible to handle a non-match in the loop external to regex, it is simpler to delegate it to regex itself by providing the last all-absorbing group, e.g. (.+), whose maximum index will signify that all previous groups failed.

Extended Definition

For more comples lexers one line of regex is not convenient, so we will use extended syntax that will allow us to format the definition with one line per token and convenient comments.

Here is how the same case will look like,

LEX2=: noun lxdefine
  ( \s+    )|# 0 space
  ( [a-z]+ )|# 1 words
  ( .+     ) # 2 error
)

With identical result

   (LEX1 lxmatches TEST1a) -: LEX2 lxmatches TEST1a
1
   (LEX1 lxmatches TEST1b) -: LEX2 lxmatches TEST1b
1

Note the time is the same, since it is compiled into the same form,

   ts 'LEX2 lxmatches TEST1c'
0.882967 396032

More Complex Example

Here we will create a prototype of XML lexer to handle input, such as

TEST3a=: '<a>qq</b>'
TEST3b=: ' <? zz ?><a>qq</b>'
TEST3c=: '<a> <!-- cc --> </b>'
TEST3d=: ' <? zz ?><a>qq</b><error'
TEST3e=: ' <? zz ?><a>qq</b>< error>'
TEST3f=: ($~ 1000*#)' <? zz ?><a>qq</b>'    NB. 18000 chars

We will only care about discerning various tags and text, without going into the tag level.

LEX3=: noun lxdefine
  ( <[a-z][^>]*>                   )|# 0 beg tag
  ( </[a-z][^>]*>                  )|# 1 end tag
  ( <\?(?:[^?]|\?[^>])*\?>         )|# 2 pi
  ( <!--(?:[^-]|-[^-]|--[^-]|)*--> )|# 3 comment
  ( [^<>]+                         )|# 4 text
  ( .+                             ) # 5 error
)

Here's what we've got

   LEX3 lxview TEST3a
0 0 3 [<a>]
4 3 2 [qq]
1 5 4 [</b>]
   LEX3 lxview TEST3b
4  0 1 [ ]
2  1 8 [<? zz ?>]
0  9 3 [<a>]
4 12 2 [qq]
1 14 4 [</b>]
   LEX3 lxview TEST3c
0  0  3 [<a>]
4  3  1 [ ]
3  4 11 [<!-- cc -->]
4 15  1 [ ]
1 16  4 [</b>]
   LEX3 lxview TEST3d
4  0 1 [ ]
2  1 8 [<? zz ?>]
0  9 3 [<a>]
4 12 2 [qq]
1 14 4 [</b>]
5 18 6 [<error]
   LEX3 lxview TEST3e
4  0 1 [ ]
2  1 8 [<? zz ?>]
0  9 3 [<a>]
4 12 2 [qq]
1 14 4 [</b>]
5 18 8 [< error>]

A pleasant surprise is that a more complex construct with more tokens takes less time

   ts 'LEX3 lxmatches TEST3f'
0.784982 789248

Case Study: Word Formation on Lines

Roughly according to Word Formation on Lines we build a lexer. Because of intuitive nature of regex, it takes only about 30 min to incrementally build an observable lexer.

LEX4=: noun lxdefine
  ( [0-9_][0-9a-z_\t .]*[.:]*      )|# 0 numeric
  ( NB\..*                         )|# 1 comment
  ( [a-z][0-9a-z_]*[.:]*           )|# 2 alpha
  ( '(''|[^'])*'                   )|# 3 string
  ( \r?\n|\r                       )|# 4 line break
  ( [ \t]+                         )|# 5 space
  ( .[.:]*                         )|# 6 symbol
  ( .+                             ) # 7 error
)

The following test cases were used

...
TEST4g=: 'ab_c: -./ i.2 3 4'
TEST4h=: 'ab_c: -./ i.2 3 NB. 3 4'
TEST4i=: 'ab ; ''cd''''ef'' ,": 2 3 4'
TEST4j=: '1 2+3 2',LF,'4 2',CR,'5 2',CRLF,'6 2'

An example of output

   LEX4 lxview TEST4i
2  0 2 [ab]
6  2 1 [ ]
7  3 1 [;]
6  4 1 [ ]
3  5 8 ['cd''ef']
6 13 1 [ ]
7 14 1 [,]
7 15 2 [":]
6 17 1 [ ]
0 18 5 [2 3 4]

However, performance of large input somewhat degrades. It needs to be further investigated if it is in J code calling regex library or intrinsically nature of regex. Here's Performance Monitor output

   z=: 1e5 $ 'abcolumb+boustrophedonic-chthonic*'
   load 'jpm'
   start_jpm_ ''
357142
   $LEX4 lxmatches z
17647 3
   showtotal_jpm_''
 Time (seconds)
+--------+------+--------+--------+-----+----+----+
|name    |locale|all     |here    |here%|cum%|rep |
+--------+------+--------+--------+-----+----+----+
|rxmatch |jregex|1.750778|1.040360| 57.6| 58 |7763|
|jregexec|jregex|0.688747|0.688747| 38.1| 96 |7763|
|[rest]  |      |        |0.078568|  4.3|100 |    |
|[total] |      |        |1.807675|100.0|100 |    |
+--------+------+--------+--------+-----+----+----+
    0 1 showdetail_jpm_'rxmatch_jregex_'
 Time (seconds)
+--------+--------+----+-------------------------------------------------------------------+
|all     |here    |rep |rxmatch_jregex_                                                    |
+--------+--------+----+-------------------------------------------------------------------+
|0.054887|0.054887|7763|dyad                                                               |
|0.082529|0.082529|7763|[0] if. lb=.32=3!:0 x do. ph=.>0{x else. ph=.x end.                |
|0.048568|0.048568|7763|[1] if. cx=.2=3!:0 ph do. hx=.rxcomp ph                            |
|0.075010|0.075010|   0|[2] else. rxlastxrp=:>1{((hx=.ph)-1)({"1)rxpatterns end.           |
|0.060418|0.038747|7763|[3] nsub=.rxnsub rxlastxrp                                         |
|0.912908|0.224161|7763|[4] rxlastrc=:>0{rv=.jregexec rxlastxrp;(,y);rxms;((2*rxms)$_1 0);0|
|0.043624|0.043624|7763|[5] if. cx do. rxfree hx end.                                      |
|0.045776|0.045776|7763|[6] m=.(nsub,2)$>4{rv                                              |
|0.039382|0.039382|7763|[7] t=.(0{"1 m)                                                    |
|0.159160|0.159160|7763|[8] m=.t,.-~/"1 m                                                  |
|0.085159|0.085159|7763|[9] m=._1 0((t=_1)#i.#t)}m                                         |
|0.143358|0.143358|7763|[10] if. lb do. (>1{x){m else. m end.                              |
|1.750778|1.040360|7763|total dyad                                                         |
+--------+--------+----+-------------------------------------------------------------------+
 Space (bytes)
+-----------+-----------+----+-------------------------------------------------------------------+
|all        |here       |rep |rxmatch_jregex_                                                    |
+-----------+-----------+----+-------------------------------------------------------------------+
|259,553,664|259,553,664|7763|dyad                                                               |
| 10,433,472| 10,433,472|7763|[0] if. lb=.32=3!:0 x do. ph=.>0{x else. ph=.x end.                |
|  2,980,608|  2,980,608|7763|[1] if. cx=.2=3!:0 ph do. hx=.rxcomp ph                            |
|  4,471,488|  4,471,488|   0|[2] else. rxlastxrp=:>1{((hx=.ph)-1)({"1)rxpatterns end.           |
|  1,490,496|  1,490,496|7763|[3] nsub=.rxnsub rxlastxrp                                         |
|270,483,968|  1,490,496|7763|[4] rxlastrc=:>0{rv=.jregexec rxlastxrp;(,y);rxms;((2*rxms)$_1 0);0|
|    496,832|    496,832|7763|[5] if. cx do. rxfree hx end.                                      |
|          0|          0|7763|[6] m=.(nsub,2)$>4{rv                                              |
|  3,974,656|  3,974,656|7763|[7] t=.(0{"1 m)                                                    |
|  1,490,496|  1,490,496|7763|[8] m=.t,.-~/"1 m                                                  |
|  7,949,312|  7,949,312|7763|[9] m=._1 0((t=_1)#i.#t)}m                                         |
|  1,490,496|  1,490,496|7763|[10] if. lb do. (>1{x){m else. m end.                              |
|564,815,488|295,822,016|7763|total dyad                                                         |
+-----------+-----------+----+-------------------------------------------------------------------+

Further Enhancements

Some very simple enhancements will make it possible to have more powerful functionality preserving the simplicity of design.

Our lexer so far uses rxmatches, which contains a loop. If we make our own loop, we would have more control over processing tokens, such as non-match handling.
If we own the loop, we could make the contexts feature, i.e. have several sets of groups, which are switched based on occurence of a particular token, i.e. a simple high-level state machine.
The comment space in the definition can be used to host J expressions to process the tokens on the go, the evaluator stage.

Utilities

These definitions do not have any additional functionality, just facilities to present input and format results.

require 'regex'

lxdefine=: 1 : '''(?x)'' , m : 0'
lxresult=: ((1 i.~[:*@,(1 1,:_ 1)&(];.0)) , {.)"2
lxmatches=: lxresult@rxmatches
lxfrom=: >@(([:' ['&,,&']')&.>)@rxfrom
lxview=: lxmatches (":@[ ,. }."1@[ lxfrom ]) ]