User:Doug Mennella/RegularExpressions

From J Wiki
Jump to navigation Jump to search

Parsing regular expressions

The main motivation for this page is to provide a slick skin of an API around the code presented in this article on Nondeterministic Finite Automata. I.e. to take some text representation of a regular expression parse it into a tree and walk the tree to generate a state machine.

The first order of business is thus to parse a regular expression. To keep things concrete we'll be looking at this example for parsing a number:

r=:'[\+-]?[0-9]+(\.[0-9]+)?([Ee][\+-]?[0-9]+)?'

At a macro level, regular expressions are groups of alternatives each of which can contain alternatives. These will all end up as nodes of our tree. Thus if we have an expression (A|B|C) Then we'll have a node for the group as a whole and one for each subexpression A, B and C. Actually, A, B and C will be subexpressions which get broken down further so instead of making them nodes we'll have the node representing the subexpression be represented by |. We start by forcibly adding one for the first alternative in each group. In this case we'll have (|A|B|C) with the open paren representing the group and each successive vertical bar representing the subexpressions to be further parsed.

But before we do that we'll need to distinguish special characters like an open paren and a vertical bar from their use as pure text. In the string the pure text interpretation will be escaped with a backslash. We'll use this to mark which characters are pure text and avoid interpreting them as special characters.

Let's get started.

There is a classic trick for turning off every other 1 in a run of 1's in a Boolean vector. We'll use this for marking which characters are escaped by marking where all the escapes are and then unmarking every other one in a run of them (because escapes can be escaped). The trick is to use a left fold of > over the Boolean vector.

E.g.

   0]F:.> 1 1 0 1 1 1 0 0 1 0 1 1 1 1
1 0 0 1 0 1 0 0 1 0 1 0 1 0

One way to understand how this works is that for Boolean vectors > is just x AND -.y where y is the accumulator. Thus as long as x is 1 we keep toggling y, but if x is 0 then the value is 0.

Thus our mask function just uses this followed by a shift to mark what's being escaped rather than the escape characters.

msk =: _1|.!.0 0:]F:.>'\'=]

With that done we can massage our input as described above taking care to ignore characters which are escaped. There's one more tiny bit of massaging though. We surround the entire input with parens to represent the outermost group thus giving us a root node for the parsed expression.

   exp =: (]#!.'|'~1+0j1 * msk<'('=])&:('(',')',~])
   exp r
(|[\+-]?[0-9]+(|\.[0-9]+)?(|[Ee][\+-]?[0-9]+)?)

Now we have parens representing the grouped nodes and vertical bars representing a node for each sub-expression, though this is all still packed into a string. To break this down into a tree we use another classic trick of counting the levels of parentheses. In other words counting the depths of the parentheses. The idea being to up our count at each open parenthesis and decrease it at each close parenthesis. We use a function in which we pass the delimiters so that it can be used with square brackets or braces, etc.

   dp0 =: ({.-~+/\@:-/)@(msk@] <"1="0 _)
   '()' dp0 exp r
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 0

This is only slightly different than the classic trick of +/\@:-/@('()'="0 _ ]). For one, we consider only delimiters which are not masked. The other point is that the classical count necessarily has the count of each pair of opening and closing parentheses at different depths. To level things off we decrease the level of the opening parens by subtracting 1 from the calculated depth. This also ensures that parentheses are one depth lower than their contents.

Lining up the depths with content we can begin to see the shape of the tree taking place:

   (,:'()';@:(":"0)@:dp0]) exp r
(|[\+-]?[0-9]+(|\.[0-9]+)?(|[Ee][\+-]?[0-9]+)?)
01111111111111122222222211122222222222222222110

Every time the depth increases by 1 we're descending down the tree and when the depth increases, we're backing up and following siblings. In fact the order here is a depth-first preordering of the nodes of the tree we're preparing to create.

Now for my favorite trick. Currently the vertical bars are at the same depth as the expressions that follow them, but our goal is have them represent the root of the associated subtree. I.e. we'd like their depth to be one higher than the parsing of the expression. We'd like to effectively "push down" the expressions. If we double all the depths, that should make room for the vertical bars to have depths between its parent and its associated expression.

   (,:'()';@:(":"0)@:(('|'=])-~2*dp0)]) exp r
(|[\+-]?[0-9]+(|\.[0-9]+)?(|[Ee][\+-]?[0-9]+)?)
01222222222222234444444422234444444444444444220

Magic! This only really works because we know that every expression has an associated vertical bar. That is, descending a tree increases depth by at most one, so we must have a step between the original parent and the expression whose depth is doubled. Of course to be extra careful we should consider only unescaped vertical bars.

Let's visualize the depth a bit differently:

 
   |:@(#"0~'()'(('|'=])-~2*dp0)]) exp r
 |[\+-]?[0-9]+(|\.[0-9]+)?(|[Ee][\+-]?[0-9]+)? 
  [\+-]?[0-9]+(|\.[0-9]+)?(|[Ee][\+-]?[0-9]+)? 
               |\.[0-9]+   |[Ee][\+-]?[0-9]+   
                \.[0-9]+    [Ee][\+-]?[0-9]+  

This makes it a bit easier to visualize the tree. Before handling the remaining nodes, we'll briefly talk about our actual representation of the structure of this tree as a parent vector. This is covered more slowly here, but the basic idea is to keep track of all the edges of the tree as a graph. Since each edge is between a child and a parent and each child has at most one parent, it suffices to keep track of the parent of each child node. Each node but the root is a child node, but we make the parent a child node with a bit of sleight of hand by declaring it to be a child of itself.

In this depth rendering, each character represents a node which we can identify with its index. How do we find the index of the parent of a given node given a depth vector? Since the depth vector is listed in depth-first pre-order (I.e. from walking the tree in depth first order) we need to find the closest index to our left at depth one less that ourselves.

There are various ways to perform this calculation. We use one Marshall Lochbaum presented in the J forum. Using this and the tree rendering algorithm presented in the previous link we get a more traditional rendering of the tree we have so far:

 
   ((<"0@[)tree'()'pv_flat@:(('|'=])-~2*dp0)]) exp r
  ┬                                                                                                                          
  (                                                                                                                          
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐  
  |                                                                                                                       )  
  ├──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┬──┬──┬───────────────────────────────────────────────┬──┐     
  [  \  +  -  ]  ?  [  0  -  9  ]  +  (                       )  ?  (                                               )  ?     
                                      │                             │                                                        
                                      |                             |                                                        
                                      ├──┬──┬──┬──┬──┬──┬──┐        ├──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐           
                                      \  .  [  0  -  9  ]  +        [  E  e  ]  [  \  +  -  ]  ?  [  0  -  9  ]  +  

We also make "character classes" be nodes of themselves with their content as children. This is simply the same as the depth calculation we used for grouping with parentheses. These character classes should add subtrees to the tree we've constructed so far which means we simply make their content deeper than it is currently. This amounts to simply adding the two depth vectors! Note that we aren't doing any validation of the input here and assume we don't have nested character classes or classes which span across grouped expressions. We could check for this, but we'll ignore it as an issue for now.

So far we have the following:

dp=:{{
  d0 =.((msk<'|'&=)-~2*'()'dp0])
  d1 =.'[]' dp0 ]
  (d0+d1) y
}}
   ((<"0@[)tree pv_flat@dp) exp r
  ┬                                                                                                        
  (                                                                                                        
  ├─────────────────────────────────────────────────────────────────────────────────────────────────────┐  
  |                                                                                                     )  
  ├────────┬──┬──┬────────┬──┬──┬────────────────────┬──┬──┬──────────────────────────────────────┬──┐     
  [        ]  ?  [        ]  +  (                    )  ?  (                                      )  ?     
  ├──┬──┐        ├──┬──┐        │                          │                                               
  \  +  -        0  -  9        |                          |                                               
                                ├──┬──┬────────┬──┐        ├─────┬──┬────────┬──┬──┬────────┬──┐           
                                \  .  [        ]  +        [     ]  [        ]  ?  [        ]  +           
                                      ├──┬──┐              ├──┐     ├──┬──┐        ├──┬──┐                 
                                      0  -  9              E  e     \  +  -        0  -  9      

At this point, we really have no more use for the closed parentheses or closed brackets or even the escape character as long as we keep track of which characters are masked. We remove these characters in parallel with their associated depths and the mask vector. Perhaps overkill here, but our strategy is to shuffle them to the back of each of these lists and then chop of the tail.

prune =: (-@(+/)@]}./:)&.:>"0 1

Note that as long as you're deleting leaf nodes this doesn't disturb the structure of the depth vector. I.e. we're not getting in the way of any paths using a depth-first search.

The final step of our tree construction is to make "modifiers" like + or * or ? be parents of what they're modifying. This is going to be tricky to do using the depth vector because our doubling trick isn't going to work per the caveat above. Instead we generate a parent vector and "reparent" the modified "fragments" with their associated modifiers. The modifiers will modify their sibling to the left. That is, we'll first group our nodes by parent using the parent vector, look for modifiers and collect their indices and those of their siblings to the left. Finally we use amend to do the reparenting.

prs=:{{
  d =. dp e=.exp y
  msk =. msk e
  'e d msk'=.(e;d;msk) prune (msk < '])\' e.~ ])e
  p =. pv_flat d
  'frg mod' =. |:@;@(<@:|:@:(([#~"1(],:1|.!.0]))(msk{~[)<'?+*'e.~e{~])/. i.@#)p
  p =. frg mod} p
  (e;p)
}}
   {{(<"0 x) tree y}}&>/prs r
  ┬                                               
  (                                               
  │                                               
  |                                               
  ├─────┬────────┬───────────┐                    
  ?     +        ?           ?                    
  │     │        │           │                    
  [     [        (           (                    
  ├──┐  ├──┬──┐  │           │                    
  +  -  0  -  9  |           |                    
                 ├──┐        ├─────┬─────┐        
                 .  +        [     ?     +        
                    │        ├──┐  │     │        
                    [        E  e  [     [        
                    ├──┬──┐        ├──┐  ├──┬──┐  
                    0  -  9        +  -  0  -  9  

The code

NB. http://www.jsoftware.com/pipermail/programming/2021-May/058263.html

pv_flat =: 3 : 0
  p =. 2| g2 =. /: ,y+/i.2
  g =. <.g2%2
  g /:~&:((-.p)&#) (>./\p*i.#p){g
)

prune =: (-@(+/)@]}./:)&.:>"0 1

msk =: _1|.!.0 0:]F:.>'\'=]
exp =: (]#!.'|'~1+0j1 * msk<'('=])&:('(',')',~])
dp0 =: ({.-~+/\@:-/)@(msk@] <"1="0 _)

dp=:{{
  d0 =.((msk<'|'&=)-~2*'()'dp0])
  d1 =.'[]' dp0 ]
  (d0+d1) y
}}

prs=:{{
  d =. dp e=.exp y
  msk =. msk e
  'e d msk'=.(e;d;msk) prune (msk < '])\' e.~ ])e
  p =. pv_flat d
  'frg mod' =. |:@;@(<@:|:@:(([#~"1(],:1|.!.0]))(msk{~[)<'?+*'e.~e{~])/. i.@#)p
  p =. frg mod} p
  (e;p)
}}

Generating state machines

The next step is to walk this tree and generate a state machine for parsing text. Now's a good time to review the wiki on Nondeterministic Finite Automata linked above. Basically every interior node of our tree corresponds to one of the constructions above. Groups are represented in the tree by left parentheses which correspond to the construction orf applied to the clauses of group. Each clause, represented in the tree by a vertical bar, corresponds to the construction seqf on the constituents of that clause, etc. The leaves of our tree correspond to raw character text which we assemble into a state machine component using chf.

There is one tiny hitch here. A plain . as leaf might be plain text if it's escaped or represent any character if it is not. We'll employ a trick here to ensure that the former ends up as a leaf of our node and the latter as an interior node with a single dummy child. When walking our tree we won't descend into these dummy nodes but rather simply return the construction for any character on their parents. The adjustment to prs is pretty trivial. We simply look for unmasked . nodes and append their indices to the parent vector and an arbitrary character to the values vector.

   d =. I. msk AND '.'=e
   (e,'.'#~#d);p,d

Now for the walking! We'll be walking down the tree from the root recursively calling our function on the children of our node and then applying the appropriate construction for our node to the results. The parent vector naturally lends itself to walking up a tree. To walk down the tree we'll need a parent to child mapping (stripping identical children for self-parenting nodes).

   ]k=. ~.p
0 5 2 1 10 6 19 11 12 18 14 34 20 21 22 28 25 33 29
   ]v=.a:,~((<@-.~)/.. i.@#) p
┌─┬─┬───┬──────────┬─┬─────┬──┬──┬─────┬──┬────────┬──┬──┬────────┬─────┬──┬─────┬──┬────────┬┐
│1│2│3 4│5 10 19 34│6│7 8 9│11│12│13 18│14│15 16 17│20│21│22 28 33│23 24│25│26 27│29│30 31 32││
└─┴─┴───┴──────────┴─┴─────┴──┴──┴─────┴──┴────────┴──┴──┴────────┴─────┴──┴─────┴──┴────────┴┘
   #k
19
   #v
20

Here k contains the indices of parent vectors and v the indices of their respective children. We add an empty box at the end v since we'll be looking up keys in k and finding their children. in v this final value acts as a default when we can't find a key in k.

Our tree walk will be a conjunction which uses these as operands similar to dfo. The code should be pretty easy to follow.

   disp e cmp p
┌───┬───────────────────────────────────────────────────────────────────────┐
│0 1│┌┬──────────┬──────────┬──────────┬──────────┬──┬──────────┬──────────┐│
│   │││+-        │          │          │          │  │          │          ││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││0123456789│0123456789│          │0123456789│  │          │0123456789││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││          │          │.         │          │  │          │          ││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││          │          │0123456789│0123456789│  │          │0123456789││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││          │          │          │          │Ee│Ee        │          ││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││          │          │          │          │  │+-        │          ││
│   │├┼──────────┼──────────┼──────────┼──────────┼──┼──────────┼──────────┤│
│   │││          │          │          │          │  │0123456789│0123456789││
│   │└┴──────────┴──────────┴──────────┴──────────┴──┴──────────┴──────────┘│
└───┴───────────────────────────────────────────────────────────────────────┘

the code

NB. No conformance check
clsf =: {{
  r=.a.{~;({.r)+&.><@i."0>:-~/r=.a.i.y{~_1 1+"0 1 i=.(0,<:#y)-.~'-'I.@:=y
  chf r,y([#~]-.@e.~i.@#@[)i
}}

cmpstp=:{{
  chld =. > n {~ m i. y
  NB. leaves are just raw character data
  if. -. * # chld do.
    chf y { x return.
  end.

  NB. Inner nodes
  select. y { x
  case. '(' do. ]F.: orf x (m cmpstp n)"_ _1 chld
  case. '|' do. ]F.: seqf x (m cmpstp n)"_ _1 chld
  case. '[' do. clsf x {~ chld
  case. '.' do. chf a.
  case. '+' do. repf x m cmpstp n {. chld
  case. '*' do. kleenestar x m cmpstp n {. chld
  case. '?' do. optf x m cmpstp n {. chld
  end.
}}

cmp=: {{ x (~.y) cmpstp (a:,~(<@-.~/.. i.@#) y) 0 }}