Essays/RegularExpressions/NondeterministicFiniteAutomata

From J Wiki
Jump to navigation Jump to search

(Based loosely on material originally presented in the j programming forum and organized vaguely in the style of a J session, where code with line numbers represents content from a "scripting window" and other code represents content from a "session window".)

Chapter 1: Matching

Overview

A regular expression is an expression which identifies a regular language.

For example (a(b|cd))+ matches the strings "ab", "acd" and "abacd" and "abababab" but does not match the string "abcd". The corresponding "regular language" here would be the infinite set of strings that matched by this expression.

If we diagramed a state machine for (a(b|cd))+ with 0 for our starting state and 4 for our "matched" state, it might look something like Abcd.png

This is a bit different from a classic diagram of a state machine, because some of our letters lead to multiple states. So how do we choose?

One approach, when matching, is to recursively try each possibility on the remaining string - if the recursive match fails we try a different option. If it succeeds we have a match. (This can lead to exponentially slow processing when the possibilities are many.)

Another approach, which J supports nicely, is to try all of the possibilities on the remaining string. Instead of a single state index, we use a (possibly empty) list of state indices. Here, when matching 'abacd' we start out in state 0. After encountering an 'a' we are in states 1 2, after encountering the 'b' we are in states 0 4 (state 1 leads to states 0 and 4 for b, state 2 leads nowhere for b), and the next 'a' puts us back in states 1 2, then the 'c' puts us in state 3 and the 'd' puts us in states 0 4 again. And that 4 means we have a match. In other words, if our list of states contains a 4 when we reach the end of the string that means that we have a match.

To represent a regular expression like this, we might use a bit array with three dimensions, representing three different kinds of sets in a regular expression recognizer:

  1. The first dimension represents "current state"
  2. The second dimension represents "ascii character"
  3. The third dimension represents "next state"

This is not quite complete -- we also need to know which state(s) to start in, so a regular expression will actually be a boxed pair, with the first pair being a bit list with 1 for current state, and the second box will contain the transition array.

This representation is a bit bulky, so for display purposes it's more meaningful to collapse the second dimension, replacing it with a box indicating the matching letters.

But it's also convenient for evaluation purposes to instead collapse the third dimension, replacing it with a box of next state indices.

Foreshadowing

We can convert between these two "collapsed representations" of a regular expression, first converting to the bit array then collapsing the other dimension:

19NB. relatively compact display of regular expression state transitions
20NB. rows: initial state, columns: subsequent state
21NB. state is represented as a list of indices into the transition array
22disp=: <@(#&a.)"1@(1&|:)@:(((e.~i.)1+0>.>./)@>)^:L.L:1

We can write a routine to translate a string into a regular fragment which would match that string (note that this needs to be viewed in a fixed width font, and note that we're building up to the implementation of stringf - its definition is further down):

   disp stringf 'THIS'
┌─┬──────────┐
│0│┌┬─┬─┬─┬─┐│
│ │││T│ │ │ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │H│ │ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │I│ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │ │S││
│ │└┴─┴─┴─┴─┘│
└─┴──────────┘

Or, if I express this in the pcre syntax for regular expressions, it would be simply: 'THIS'

A "regular fragment" is just like the "regular expression" data structure I described, except that the details of the final state are not specified.

In this display the 0 in the leftmost box means that this fragment has only state 0 for its initial state. And in the second box we see the state definitions (from top to bottom) and the states we transition to (from left to right) and the character which performs that transition (in the inner boxes).

But also notice that we have more columns than rows. Row indices are for our "current state", and column indices are for our "next state". But we also need to represent the "matched" state and it's simple if we keep this matched "pseudo-state" independent of our states with potential future transitions. So we'll always have one more column than rows when presenting the transitions in this fashion.

Character Expressions

So... how do we build a machine representation of a regular expression?

First, I needed something that would build a transition for a set of characters:

1ch=: a.&i.
2NB. regular expression matching a character (or set of characters)
3chf=: (,0) ,&< {{ ,: (<,1) (ch y) }256#a: }}
4
5NB. regular expression matching a range of characters (like '0' through '9')
6rangef=: [: chf (<. + {{i.1+|y-x}})&.ch

Here's how a raw display of the regular expression data structure for a set of characters looks (which is difficult to read, because it occupies so much space on the screen and because we do not have any column headers), along with a more compact "display form":

   chf 'THIS'
┌─┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│0│┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬─┬─┬┬┬┬┬┬┬┬┬┬─┬─┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐│
│ ││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││1│1││││││││││1│1│││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴─┴─┴┴┴┴┴┴┴┴┴┴─┴─┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘│
└─┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
   disp chf 'THIS'
┌─┬───────┐
│0│┌┬────┐│
│ │││HIST││
│ │└┴────┘│
└─┴───────┘

In other words, any of the characters in the argument to chf will give us a transition from the first state to the final state of this fragment. In the pcre syntax for regular expressions, this example would be '[THIS]'.

As an aside, later on, we'll want to be working with ranges of characters 0 through 9, a through z, etc. So rangef is a convience routine for repersenting these kinds of expressions:

   rangef/'AZ'
┌─┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│0│┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐│
│ │││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│1│││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘│
└─┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
   disp rangef/'AZ'
┌─┬─────────────────────────────┐
│0│┌┬──────────────────────────┐│
│ │││ABCDEFGHIJKLMNOPQRSTUVWXYZ││
│ │└┴──────────────────────────┘│
└─┴─────────────────────────────┘

Next, we need a routine to let us combine fragments sequentially. This is a bit more complicated:

 7NB. regular expression y follows regular expression x
 8seqf=: {{
 9  NB. y starts match up with x transitions
10  NB. x starts remain 
11  NB. (and incorporate [adjusted] y starts for optional x)
12  'Ys Yt'=. y+L:0 Xn=. #Xt['Xs Xt'=. x
13  Rs=. ~.Xs,(Xn e. Xs)#Ys
14  Rs;<(~.@,&Ys^:(Xn&e.)L:0 Xt),Yt
15}}
16
17NB. string to regular expression
18stringf=: seqf/@:(chf"0)

When I combine my initial states, the final state of the first fragment becomes any of the initial states in the second fragment. Also, any transition to the final state of the first fragment becomes a transition to any of the initial states of the second fragment.

   stringf 'THIS'
┌─┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│0│┌┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬─┬─┬┬┬┬┬┬┬┬┬┬─┬─┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┬┐│
│ ││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││ │ ││││││││││ │1│││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │├┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┤│
│ ││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││2│ ││││││││││ │ │││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │├┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┤│
│ ││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││ │3││││││││││ │ │││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │├┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼─┼─┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┼┤│
│ ││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││ │ ││││││││││4│ │││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││││
│ │└┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴─┴─┴┴┴┴┴┴┴┴┴┴─┴─┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┴┘│
└─┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Revisiting restructuring

Here's our routine to turn the regular expression into a hopefully more readable display form:

19NB. relatively compact display of regular expression state transitions
20NB. rows: initial state, columns: subsequent state
21NB. state is represented as a list of indices into the transition array
22disp=: <@(#&a.)"1@(1&|:)@:(((e.~i.)1+0>.>./)@>)^:L.L:1

The raw regular expression data structure has two boxes - the starting state and the nested transition table. Here, ^:L.L:1 operates only in the transition table box, leaving the starting state untouched.

Next, the expression (((e.~i.)1+0>.>./)@>) turns the transition table into a bit array. It opens each box, finds the largest state index in each box and then forms a list of integers containing at those state indices and uses e. to set the needed bits. A little more work is done here to deal with empty boxes (>./ is __ so 0>.>./ is 0). This gives us a three dimensional bit array, with dimension indices: starting state, ascii character, ending state.

This three dimensional bit array would perhaps be best represented as a sparse array. J currently does not support mixing boxes with sparse arrays, but we can remove the array from its box:

   $.1{::(((e.~i.)1+0>.>./)@>)^:L.L:1 stringf 'THIS'
0 84 1  1
1 72 2  1
2 73 3  1
3 83 4  1

Finally, (1&|:) moves the character index dimension to become the last dimension, and <@(#&a.)"1 turns the character dimension into boxes containing the selected characters.

   disp stringf 'THIS'
┌─┬──────────┐
│0│┌┬─┬─┬─┬─┐│
│ │││T│ │ │ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │H│ │ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │I│ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │ │S││
│ │└┴─┴─┴─┴─┘│
└─┴──────────┘

Conditional Expressions

Another thing we might want to do involves alternation -- we have two regular expression fragments and we want to have a regular expression that matches if either of our alternatives match. That's something like this:

23NB. if either regular expression x or y would match,
24NB. then result would match
25orf=: {{
26  'Xs Xt'=. x ['Ys Yt'=. y
27   adjX=: (+(#Yt)*(#Xt)&=)L:0
28   adjY=: (#Xt)&+L:0
29   (~. (adjX Xs),adjY Ys) ;< (adjX Xt),adjY Yt
30}}

Here, our new start state is almost the start states of our two options strung together. We have to adjust the index number of the final state from our x expression and the index number of all non-initial states of our y expression.

So now we can do something like this:

   disp (chf 'b') orf (chf 'c') seqf chf 'd'
┌───┬───────┐
│0 1│┌┬┬─┬─┐│
│   ││││ │b││
│   │├┼┼─┼─┤│
│   ││││c│ ││
│   │├┼┼─┼─┤│
│   ││││ │d││
│   │└┴┴─┴─┘│
└───┴───────┘

This corresponds to the pcre expression 'b|cd'. It matches the string ,'b' and the string 'cd', but no other strings. We have three rows here, so 3 is our matching state. When matching the string 'b' row 0 takes us to row 4. When matching the string 'cd', row 1 takes us to row 2 when we encounter the 'c', and then row 2 takes us to state 3 when we encounter the 'd'. Note that we now start with two states (row 0 and row 1 of our transition matrix). We can't end our 'b' and 'c' transitions in the same state because 'd' is valid after 'c' but not after 'b' - technically we could have started them from the same state, but the code is simpler if it doesn't have to account for that variable. (We could write an optimizer to merge states if their non-empty boxes would not overlap and all transitions to those states always lead to both states. But when we're constructing the data structure, knowing what states will lead to our current "starting state(s)" replaces some simple concepts with more complicated concepts.)

Now watch what happens if I add an 'a' in front of this last expression (something like this pcre expression: 'a(b|cd)'):

   disp (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
┌─┬──────────┐
│0│┌┬─┬─┬─┬─┐│
│ │││a│a│ │ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │ │b││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │c│ ││
│ │├┼─┼─┼─┼─┤│
│ │││ │ │ │d││
│ │└┴─┴─┴─┴─┘│
└─┴──────────┘

Now we only have one starting state. But an 'a' in that starting state gives us a transition to two different states - this conceptually matches the 0 1 state we had previously, but here it's 1 2 instead.

Repeating Expressions

Another thing we might want to do with a regular expression fragment is to say that it can be repeated an arbitrary number of times. This is equivalent to saying that any transition to the end state must also be a transition to the start states.

31NB. regular expression which may be repeated indefinitely
32repf=: {{ Ys;<(~.@(0&,)^:((#Yt)&e.))L:0 Yt ['Ys Yt'=. y}}
   NB. pcre:  '(a(b|cd))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf chf 'd'
┌─┬───────────┐
│0│┌─┬─┬─┬─┬─┐│
│ ││ │a│a│ │ ││
│ │├─┼─┼─┼─┼─┤│
│ ││b│ │ │ │b││
│ │├─┼─┼─┼─┼─┤│
│ ││ │ │ │c│ ││
│ │├─┼─┼─┼─┼─┤│
│ ││d│ │ │ │d││
│ │└─┴─┴─┴─┴─┘│
└─┴───────────┘

(Note that non-empty boxes in the left-most column of the transition table are for transitions which "restart" the regular expression - 0 is always a part of the starting state. And, similarly, non-empty boxes in the right-most column of the transition table are for transitions which would match - if that transition happens for the last character of the string.)

We might also what to implement an optional regular expression. This is equivalent to saying that the end state is an initial state:

33NB. regular expression which also matches the empty string
34optf=: {{ (~.Ys,#Yt);<Yt ['Ys Yt'=.y }}
   NB. pcre:  'a?'
   disp optf chf 'a'
┌───┬────┐
│0 1│┌┬─┐│
│   │││a││
│   │└┴─┘│
└───┴────┘

Or, for example

   NB. pcre: '(a|(b|cd?))+'
   disp repf (chf 'a') seqf (chf 'b') orf (chf 'c') seqf optf chf 'd'
┌─┬───────────┐
│0│┌─┬─┬─┬─┬─┐│
│ ││ │a│a│ │ ││
│ │├─┼─┼─┼─┼─┤│
│ ││b│ │ │ │b││
│ │├─┼─┼─┼─┼─┤│
│ ││c│ │ │c│c││
│ │├─┼─┼─┼─┼─┤│
│ ││d│ │ │ │d││
│ │└─┴─┴─┴─┴─┘│
└─┴───────────┘

(Notice that our rows here correspond to individual letters in the pcre presentation of the expression. If the same letter appeared multiple times in the expression, this would show up as the same letter appearing in separate rows.)

Matching, testing, ...

Here are a few other routines, and some tests:

35match=: {{
36  i=. i.#t ['s t'=. y
37  for_c. a.i.x do.
38    s=. ~.;c{"1 t#~ i e. s
39  end.
40  (#t)e.s
41}}
42
43assert (disp chf'a')-:,L:0]0;<,:'';'a'
44assert (disp 'a' seqf&chf 'b')-:,L:0]0;<('';'a'),:'';'';'b'
45assert (disp 'a' orf&chf 'bc')-:0 1;<a:,.a:,.;:'a bc'
46assert (disp optf chf 'a')-:,L:0]0 1;<,:'';'a'
47assert (disp repf chf 'a')-:,L:0]0;<,:'a';'a'
48assert 'abc' match stringf 'abc'
49assert 'aaab' match (repf chf 'a') seqf chf 'b'
50assert 'aaab' match (repf optf chf 'a') seqf chf 'b'
51assert 'b' match (repf optf chf 'a') seqf chf 'b'
52assert -.'aaabb' match (repf optf chf 'a') seqf chf 'b'

Note that pcre has an irregular "reference" feature associated with its parenthesis. It also offers a more complex syntax that provides groups that cannot be referenced.

Further notes

Also, note that there are a variety of optimizations that could be implemented, which I did not implement. For example

   NB. pcre 'a|b'
   disp 'a' orf&chf 'b'
   NB. pcre '[ab]'
   disp chf 'ab'

These two fragments match exactly the same way, but the data structures are different. We could build an code which translates the larger representation to the more compact representation.

Another kind of optimization involves a scanner that instead of scanning sequentially scans every nth character to see if the match would be even possible. This could be difficult for the general case, but for a simple string match it's very easy to implement.

It's also perhaps worth implementing reference-able sub-matches. But that makes the matching algorithm considerably more complicated. The pcre code handles this by using an algorithm with exponential (slow) performance in some cases.

Note that I did not show any example like 'a*' in pcre. Here are two ways of doing that:

   disp optf repf chf 'a'
   
   disp repf optf chf 'a'

We could even define:

53kleenestar=: optf@repf

(named after Stephen Cole Kleene, who basically invented regular expressions)

and with the code presented here, this should always produce the same result as repf@optf

Application Note

Let's build a regular expression to match a ip address (ipv4):

54rseqf=: [:orf/ ([:seqf/ ({.rangef{:)@>@cut)@>
55byte=: rseqf '09';'19 09';'1 09 09';'2 04 09';'2 5 09'
56ipv4=: seqf/7$byte,:chf'.'

In other words, to match a byte (here an unsigned number representable with 8 binary digits, but actually represented in "base 10" where "10" is the traditional 10 which counts the typical number of digits which we have on our hands) we want to build sequences of digit ranges and if any of the defined sequences match, our string represents a byte (put differently: an integer in the range 0 through 255). And, an ipv4 address consists of four bytes separated by dots.

This corresponds to the pcre regular expression ([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-9])[.]([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-9])[.]([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-9])[.]([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-9]).

Testing this out:

   '127.0.0.1' match ipv4
1
   '8.8.8.8' match ipv4
1
   '8.8.8' match ipv4
0

But what if we wanted to test whether a string contained an ip address?

57anything=: kleenestar chf a.
58contained=: anything seqf anything seqf~ ]

This would allow matching like:

   'Any address from 127.0.0.1 through 127.255.255.255 should refer to the local machine' match contained ipv4
1

In other words the anything modifier allows matching in the style of grep where lines containing a match are considered to match the expression. But displaying such regular expressions legibly and concisely would require an update to disp. Perhaps:

legibly=: [^:(1 e. ((32{.a.),127}.a.) e.~ ])
concisely=: {{
  v1=. 5!:5<'y'
  v2=. 'a.-.',quote a.-.y
  v3=. 'a.-.',{{5!:5<'y'}}a.-.y
  ;{.(/:#@>)(v1 legibly y);v1;(v3 legibly v2);v3
}}

disp=: <@concisely@(#&a.)"1@(1&|:)@:(((e.~i.)1+0>.>./)@>)^:L.L:1

Chapter 2: Tokenizing

Regular expressions were originally designed as a tool for parsing computer languages. The theory of regular expressions supported the development of tools like lex and yacc. The process proposed for tokenization was: given a regular expression which matches any token, find the longest prefix of the code which matches - that becomes a token, then repeat the process on the remaining suffix of the code.

In other words, something like this:

split=: {{
  S=. s [I =. i.M=.#t ['s t'=. y   NB. unpack expression
  r=.'' [i=.j=.k=. 0               NB. initialize for looping
  while. j<#x do.
    s=. ~.;(a.i.j{x){"1 t#~ I e.s  NB. next state
    j=. j+1                        NB. next character (tentatively)
    if. M e. s do. k=. j end.      NB. remember if matched
    if. (0=#s)+.j=#x do.           NB. state has dried up?
      assert. i~:k                 NB. unparsable x if found no match
      s=. S                        NB. restart state
      r=. r,<i}.k{. x              NB. remember matched text
      i=. j=. k                    NB. restart x after last match
    end.
  end.
  assert. k=#x                     NB. unparsable x if end didn't match
  r                                NB. result
}}

We can do something like this for J, though the result is not quite the same as what ;: produces.

NB. name some character sets
through=: (<. + 4 : 'i.1+|y-x')&.ch
WS=: ' ',TAB             NB. whitespace
DIGIT=: '_','0' through '9'
ALPHA=: 'AZ',&(through/)'az'
ALNUM=: DIGIT,ALPHA
QUOT=: ''''
MARK=: '.:'

NB. expressions for some "tokens"
string=: repf (chf QUOT) seqf (repf chf a.-.QUOT) seqf chf QUOT
whitespace=: repf chf WS

 mark=: chf MARK
 number=: (chf DIGIT) seqf repf chf ALNUM,'.'
 name=: (chf ALPHA) seqf repf chf ALNUM
token1=: (number orf name orf chf a.-.QUOT,WS) seqf optf mark
nuvoc=: '{{' orf&stringf '}}'

tokens=: token1 orf string orf whitespace orf nuvoc

This is close to what we get from the ;: monad, but with some noticable differences.

This tokenizer includes whitespace as a token (which would typically need to be treated like a comment token)

   'avg=: +/ % #' split tokens
┌───┬──┬─┬─┬─┬─┬─┬─┬─┐
avg=: +/ % #
└───┴──┴─┴─┴─┴─┴─┴─┴─┘
   ;:'avg=: +/ % #'
┌───┬──┬─┬─┬─┬─┐
avg=:+/%#
└───┴──┴─┴─┴─┴─┘

Whether this is an advantage or disadvantage depends on the intended use of the result. The result of ;: is perfect for J's parse, but perhaps less desirable when trying to match position in a token stream back to character position in the original sentence.

This tokenizer cannot recognize numeric vectors as a token:

   '9 8 7 6 5 4 3:"2 1 0' split tokens
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬──┬─┬─┬─┬─┬─┬─┐
9 8 7 6 5 4 3:"2 1 0
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴──┴─┴─┴─┴─┴─┴─┘
   ;:'9 8 7 6 5 4 3:"2 1 0'
┌───────────┬──┬─┬─────┐
9 8 7 6 5 43:"2 1 0
└───────────┴──┴─┴─────┘

Numeric vectors would have to be assembled at a later stage, or the tokenizer would have to adopt an "instruction" concept similar to ;: which has an "emit vector" instruction.

This tokenizer cannot support backwards compatible handling of the nuvoc {{ and }} delimiters.

   '3{{ x{{.y }}i.4 5 6' split tokens
┌─┬──┬─┬─┬──┬─┬─┬─┬──┬──┬─┬─┬─┬─┬─┐
3{{ x{{.y }}i.4 5 6
└─┴──┴─┴─┴──┴─┴─┴─┴──┴──┴─┴─┴─┴─┴─┘
   ;:'3{{ x{{.y }}i.4 5 6'
┌─┬──┬─┬─┬──┬─┬──┬──┬─────┐
3{{x{{.y}}i.4 5 6
└─┴──┴─┴─┴──┴─┴──┴──┴─────┘
   3{{ x{{.y }}i.4 5 6
18 19 20 21 22 23

This sort of regular expression based tokenizer could use "token construction instructions" to support J's design for handling of both {{ and }} as well as backwards compatability with existing code which used {{..

But there's another lurking issue which is that the entire block of code wrapped by these delimiters should be treated as a token or "pseudo-token". We might want to put the parsed sequence in a box, or it might be more convenient to expect the token to have its delimiters removed and then re-processed and re-tokenized. The issues here are beyond the current scope of this essay.