Alex Schroeder/Sequential State Machine Introduction

From J Wiki
Jump to navigation Jump to search

I wanted to read integers from files, no matter whether they were separated by newlines, commas, or anything else. I needed a parser that extracted numbers from a string and I didn't understand the Sequential Machine example.

First, create an array mapping every input byte to a code. In this case, we create an array with 256 zeroes, and then we set the code to 1 for every digit.

   m=: 256$0
   m=: 1 (a.i.'0123456789')}m

The result looks something like this: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... – as you can see there are a bunch of ones in there. :)

Next we need a state transition table. It has 3 dimensions.

  1. the first axis is the current state we're in: 0 means we're skipping stuff, 1 means we're reading a number – we don't need any more states
  2. the second axis is the current code we're looking at: 0 means we're looking at some byte that's not a digit, 1 means we're looking at a digit (this is where m is used)
  3. the third axis is simply two integers: the first one is the new state (we only have 0 or 1 to choose from), and the second one is what to do (the output code: 0 means doing nothing, 1 means start a word, 3 means emit a word and reset)

OK, so here's how I built it:

   NB.        state 0  state 1
   s=: 2 2 2$ 0 0 1 1  0 3 1 0

Or graphically:

   <"1 s
┌───┬───┐
│0 0│1 1│
├───┼───┤
│0 3│1 0│
└───┴───┘

Rows are the current state, columns are the input code we're looking at.

Can you see it? I started thinking about it like this:

  1. given a string such as '42 16', starting at index 0, in state 0, with j being the beginning of a word pointing at -1 (in other words, no word is started) ...
  2. we look at '4' and get the new state from our m: 1 (as we're looking at a digit)
  3. given this information, we need to look at the cell [1 1] (current state 0, input code 1)
  4. the new state is set to 1, and output code 1 means we begin a new word at the current index 0
  5. we look at '2' and get the new state from our m: still 1
  6. now we're looking at the cell [1 0] (current state 1, input code 1)
  7. the new state continues set to 1, and output code 0 means nothing happens
  8. we look at the space character and get the new state from our m: 0
  9. now we're looking at the cell [0 3] (current state 1, input code 0)
  10. the new state changes to 0, and output code 3 means we emit a word, starting from index 0 (that's when we started it) up to the current position ("42") and then we set j to -1 again

And so on.

The output code 3 means we emit a word and we do *not* begin a new word. This is important at the end: we could have used the output code 2 but that means we emit a word and begin a new one (j remains set), and the result is that any trailing garbage at the end is turned into a word.

And now the parser works for everything:

   (0;s;m) ;: '42x16xx1'
┌──┬──┬─┐
│42│16│1│
└──┴──┴─┘

You might be wondering about the 0 at the beginning of that parser definition. The answer is that this tells the parser what to do with the words it finds. 0 means that the words end up in boxes. 2 means the word index and length, for example:

   (2;s;m) ;: '42x16xx1'
0 2
3 2
7 1