Vocabulary/semico

From J Wiki
Jump to: navigation, search

>> <<   Down to: Dyad   Back to: Vocabulary Thru to: Dictionary

;: y Words

Rank 1 -- operates on lists of y, producing a list of variable length for each one -- WHY IS THIS IMPORTANT?



Partitions a string into boxed words according to J's rules for word formation.

In other programming languages, words are called tokens.

   ;: '<;._1'
+-+--+--+
|<|;.|_1|
+-+--+--+

Note that an entire numeric list or a comment is a single word.

This matters if you use (;:) to tokenize some syntax resembling J but not J

   y=: 'Fine, easy as 1 2 3?'
   ;: y
+----+-+----+--+-----+-+
|Fine|,|easy|as|1 2 3|?|
+----+-+----+--+-----+-+

   z=: ' ''s t''=: 3 5   NB. multiple assignment'
   ;: z
+-----+--+---+-----------------------+
|'s t'|=:|3 5|NB. multiple assignment|
+-----+--+---+-----------------------+

Compare this with the action of verb cutopen

cutopen is a

  • Standard Library word(s) residing in the 'z'-locale.
  • Defined in the factory script stdlib.ijs.
  • View definition(s) by entering in the J session:  open'stdlib'

   cutopen y
+-----+----+--+-+-+--+
|Fine,|easy|as|1|2|3?|
+-----+----+--+-+-+--+

Common uses

1. Partition an open list into boxed strings

   ;: 'alpha bravo charlie'
+-----+-----+-------+
|alpha|bravo|charlie|
+-----+-----+-------+

Useful if you are sure J's word rules will work for your y.

Not recommended for general parsing, because ;:y signals a J error if the string is ill-formed according to J's rules

   ]z =: 'Eugene O''Neill'
Eugene O'Neill
   ;: z
|open quote
|   Eugene O'Neill


x ;: y Sequential Machine

Rank Infinity -- operates on x and y as a whole -- WHY IS THIS IMPORTANT?



Partitions a string (or a more general kind of list) y according to the rules of a given finite-state (Mealy) machine defined by noun x.


Common uses

1. Tokenize strings of code following some other syntax than J. The x-argument defines the required syntax.

Processing a string using (x ;: y) is typically many times faster than using other primitives.

The entry in the J Dictionary for ;: has a fully-worked example.


Details

Overview

A brief overview is as follows, indicating the key terms

  • The x argument provides the machine description, which comprises
    • the output style which controls the information added to the output array
    • the row/action table, also called the state table or transition table, which indicates, for each state number and input, what the machine should output and what state it should go to next
    • the input classes which are used to convert each item of y to an input code
    • the initial values of variables used for each iteration
      • input pointer i
      • word pointer j
      • row number (also called a state number) r
      • final input d
  • The output array is initialized to empty.
  • Each item of input is converted into a column number.
  • The column numbers are processed, one per iteration:
    • the next column number is supplied to the row/action table: the row number and column number specify an item of the table, which is a 2-element vector giving the new row number and an action code.
    • The action is performed and the row number is updated. Some actions add rows to the output array
  • Execution continues until a 'quit' action is encountered or all the column numbers have been supplied.
  • the output array, which contains the accumulated results of the actions performed, becomes the result of the verb.

The y argument: sequence of inputs

The y argument is the sequence of inputs to the machine. y will be converted to a list of numbers using the m portion of the x argument.

Each item of y produces one input to the machine. The inputs become column indexes to the state table.

The x argument: description of the machine

The machine description, the x argument to (;: y), is a list of boxes f;s;m;ijrd . m and ijrd may be omitted.

m, controlling conversion of the y argument

m controls the conversion of the items of y to column numbers. m may be:

  • a list of boxes. The column number produced for an item of y is the index of the first box of m whose contents have an item equal to the item of y, or #m if there is no such box. Formally, the column numbers are (y (1 i.~ (e. >)"_ 0)"_1 _ m)
  • (only if y is characters) a numeric list containing one column number for each ASCII character code. Each character of y is converted to the corresponding number of m.Formally, the column numbers are (a. i. y) { m).
  • (only if y is numeric) m may be empty or omitted, and y specifies the column numbers directly.
ijrd, giving initial values for iteration variables

ijrd gives the initial values of 4 variables used by the sequential machine

  • The input pointer i (default 0) is the index in y of the next item to be processed, and is incremented by 1 at the end of each iteration.
  • The word pointer j (default _1) is the index in y of the first item in the current word. When the action code calls for producing output, the output will start with item j. When j is _1, there is no current word.
  • The row number r (default 0) is used to index the row/action table.
  • The final column d (default _1) is an extra column number that will be processed after the last item of y has been processed; it is the column for 'end-of-input'. If d is negative, no end-of-input column is processed.
s, the row/action table (aka state table or transition table)

s gives the row/action table. This table has as many rows as needed to encode all the states of the sequential machine, and as many columns as there are possible columns numbers of mapped input. Each 1-cell of s is a 2-element list, containing in order:

  • the next state, the row number for the next iteration.
  • the action code, a number from 0 to 6 indicating the action to be performed, with meanings as follows:
Action code Addition to output array Change to word pointer j
(after any addition to the output)
0 none none
1 none j =. i
2 add single word j =. i
3 add single word j =. _1
4 add multiple words j =. i
5 add multiple words j =. _1
6 stop--no further iterations are performed none
f, indicating what the result contains

When the action code indicates that something should be added to the output, the value that is appended depends on the f parameter, which came from the x argument to x ;: y) and the values of the iteration variables. The values appended for different values of f are (r=row number, c=column number, j=word pointer, i=input pointer):


f Value appended Description
0 the items of y between j and i-1, boxed Boxed word of y
1 the items of y between j and i-1 Unboxed word of y
2 j , i-j Index and length of word
3 c + r * number of columns in s
This can be written in J as (}:$s) #: r,c
Coded row and column
4 j , (i-j) , c + r * number of columns in s
This can be written in J as j , (i-j), (}:$s) #: r,c
Index and length of word, and coded row and column
5 i , j, r , c , (<r,c){s Input pointer, word pointer, row, column, new row, action.
In this case one row is added to the output for each iteration, regardless of the action codes.

The indicated data is appended to the output array whenever an 'add single word' action is executed.

The 'add multiple words' action also adds the data, but coalesces multiple occurrences into a single added value. 'Add multiple words' actions are delayed, and consecutive ones from the same state are combined into a single word that is emitted only when the input string is exhausted, or a word is emitted from a different state.

Running the Machine

Execution of the state machine follows this program:

NB. On input the variables i, j, r, and d have been set, the state table s and the
NB. output format f have been extracted, the argument y has been converted into a
NB. list of column numbers n, and the result has been initialized to a zero-item array
NB. where the items have the proper shape for f.
while. i <: #n do.
  if. i = #n do.
    if. d >: 0 do. 'newrow action' =. (<r,c =. d) { s
    elseif. j = _1 do. break.
    elseif. f = 5 do. break.  NB. Don't output final flush 
    elseif. do. 'newrow action' =. 0 5
    end.
  else. 'newrow action' =. (<r,c =. i { n) { s
  end.
  assert. newrow < #s
  if. f = 5 do. result =. result , i, j, r, c, newrow, action
  select. action
  case. 0 do.
  case. 6 do. break.
  case. 1;2;3;4;5 do.
    if. action > 1 do.
      assert. j >: 0
      if. f ~: 5 do.
        select. f
        case. 0 do. newdata =. < j }. (i-1) {. y
        case. 1 do. newdata =. j }. (i-1) {. y
        case. 2 do. newdata =. j , i-j
        case. 3 do. newdata =. (}:$s) #: r,c
        case. 4 do. newdata =. j , (i-j) , (}:$s) #: r,c
        case. do.
          'Invalid output type' 13!:8 (1)
        end.
        if. action <: 3 do. result =. result , newdata
        else. result =. newdata (<:#result)} result
        end.
      else.
    end.
    j =. (action e. 1 2 4) { _1,i
    end.
  case. do.
    'Invalid action' 13!:8 (1)
  end.
  r =. newrow
  i =. i + 1
end.

Notes:

  • If no 'stop' action has been encountered, then after the last item of y has been processed, the end-of-input action is performed: if d is nonnegative, one last iteration is performed with d as the column number; otherwise, one final 'add multiple words' action is performed if j is not _1.
  • Executing an 'add' action when j is _1 produces a domain error.

Example Of Use

To illustrate the use of (x ;: y) we will build a machine to recognize C-style hex constants in character input, where a hex constant is '0x' followed by any positive number of hexadecimal digits.

We see that the input characters are in 4 classes

  • the character 0
  • the character x
  • the hexadecimal characters 0123456789abcdefABCDEF
  • all other characters.

We will assign these column numbers 3, 2, 1, and 0 respectively. The conversion control m can be generated by the statements

   m =. a. e. '0x123456789abcdefABCDEF'
   m =. m + a. e. '0x'
   m =. m + a. e. '0'

and can be verified by

   (a. i. '0x2aq') { m
3 2 1 1 0

We can build the state table s now. There will be 4 rows. First, we wait for 0; then we expect x; then we expect a hexadecimal digit, signaling start-of-word if we see it; then we wait for a non-hexadecimal-digit, and output the word when we get one. The state table will look like this:

Row number Description of state Newrow,Action
for input class 0
(other)
Newrow,Action
for input class 1
(hex digit)
Newrow,Action
for input class 2
(x)
Newrow,Action
for input class 3
(0)
0 Waiting for 0 0 0 0 0 0 0 1 1
1 Expecting x 0 0 0 0 2 0 0 0
2 Expecting first digit 0 0 3 0 0 0 3 0
3 Expecting nondigit or end 0 3 3 0 0 3 3 0

This state table is generated by:

   s =. 1 4 2 $ 0 0 0 0 0 0 1 1
   s =. s , 4 2 $ 0 0 0 0 2 0 0 0
   s =. s , 4 2 $ 0 0 3 0 0 0 3 0
   s =. s , 4 2 $ 0 3 3 0 0 3 3 0

and we use it with

   (0;s;m;0 _1 0 0) ;: 'qqq0x30x30x40x0xxxx'
+----+----+
|0x30|0x40|
+----+----+
   (0;s;m;0 _1 0 0) ;: 'qqq0x30x30x40x0x34a'
+----+----+-----+
|0x30|0x40|0x34a|
+----+----+-----+

Note in the second example that 0x34a was emitted by the end-of-input action.

More Information

For detailed development of elementary examples, see: Vocabulary/SequentialMachineNotes and Generate useful tokens from input.

Also explore the "Sequential Machine" and "Huffman Code" labs from the J session, in Help > Studio > Labs...

in J6, the labs are found at Studio > Labs...
in J8, the labs are found at Help > Studio > Labs...

The entry in the J Dictionary for (;:) also contains useful examples to build the noun x to define your own Sequential Machine, including a complete definition equivalent to the monadic use of (;:) to tokenize a string of J code.