From J Wiki
Jump to: navigation, search

Interpreting a Sentence

An Implementation of J

Word Formation
Name Resolution

Word Formation

Words are expressed in the standard ASCII alphabet. Primitive words are spelled with one or two letters; two letter words end with a period or a colon. The entire spelling scheme is shown in the system summary. The verb ;: facilitates exploration of the rhematic rules. Thus:

   ;: 'sum =:+/_6.95*i.3 4'
│sum│=:│+│/│_6.95│*│i.│3 4│

The source code for word formation is in the files w*.c. The process is controlled by the function wordil (word index and length) and the table state. Rows of state correspond to 10 states; columns to 9 character classes. Each table entry is a (new state, function) pair. Starting at state S, a sentence is scanned from left to right one character at a time; the table entry corresponding to the current state and character class is applied.

       New State/Function               States                Character Classes

XN  S   AN  NN  AN  9N  XN  XN  QN     S  Space              X  Other
XI  SI  AI  NI  AI  9I  X   X   QI     X  Other              S  Space
XI  SI  A   A   A   A   X   X   QI     A  Alphanumeric       A  Letters excl. NB
XI  SI  A   A   NB  A   X   X   QI     N  N                  N  The letter N
XI  SI  A   A   A   A   NZ  X   QI     NB NB                 B  The letter B
Z   Z   Z   Z   Z   Z   X   X   Z      NZ NB.                9  Digits and _
XI  SI  9   9   9   9   9   X   QI     9  Numeric            D  . Period
Q   Q   Q   Q   Q   Q   Q   Q   QQ     Q  Quote              C  : Colon
XI  SI  AI  NI  AI  9I  XI  XI  Q      QQ Even Quotes        Q  ' Quote
Z   Z   Z   Z   Z   Z   Z   Z   Z      Z  Trailing Comment

X   S   A   N   B   9   D   C   Q       Function
                                       I  j=:i [ Emit(j,i-1)
                                       N  j=:i

Emit(j,i-1) produces a pair of indices delimiting a word in the string. i is the current index, and j is an internal register; if the current word is a number immediately following a numeric list (one or more numbers), Emit combines their indices to form a single word. At the end of the string, Emit(j,i-1) is executed.

As an example, this process is applied to sum =:+/_6.95*i.3 4, the sentence used above. In the following table, the columns are: index, character in the string, the (current state, character class) pair, the (new state, function) pair, and the action. For example, the first step is step 0, the letter is s, the current (and initial) state is S, and the character class is A. From the table, the entry in row S and column A is AN, meaing the new state is A and the function code is N. The action assigns 0 to j.

 	 	State /	New State /	
i      Char  Char Class	 Function      Action
0	s	S A	   AN	   j=:0
1	u	A A	   A 	
2	m	A A	   A 	
3		A S	   SI	   j=:3 [ Emit(0,2)	  sum
4	=	S X	   XN	   j=:4
5	:	X C	   X 	
6	+	X X	   XI	   j=:6 [ Emit(4,5)	  =:
7	/	X X	   XI	   j=:7 [ Emit(6,6)	  +
8	_	X 9	   9I	   j=:8 [ Emit(7,7)	  /
9	6	9 9	   9 	
10	.	9 D	   9 	
11	9	9 9	   9 	
12	5	9 9	   9 	
13	*	9 X	   XI	   j=:13 [ Emit(8,12)	  _6.95
14	i	X A	   AI	   j=:14 [ Emit(13,13)	  *
15	.	A D	   X 	
16	3	X 9	   9I	   j=:16 [ Emit(14,15)	  i.
17		9 S	   SI	   j=:17 [ Emit(16,16)	  3
18	4	S 9	   9N	   j=:18
19				        Emit(18,18)	  4

Every primitive word has an ID (a unique byte value) defined in file jc.h. The ID for the first 128 ASCII characters are simply the byte values 0 to 127; other IDs are arbitrary assignments in "dictionary order".

   #define CLPAR      '('          /*  40 050 28       */
   #define CRPAR      ')'          /*  41 051 29       */
   #define CSTAR      '*'          /*  42 052 2a       */
   #define CPLUS      '+'          /*  43 053 2b       */
   #define CASGN      '\200'       /* 128 200 80 =.    */
   #define CGASGN     '\201'       /* 129 201 81 =:    */
   #define CFLOOR     '\202'       /* 130 202 82 <.    */
   #define CMIN       '\202'       /* 130 202 82 <.    */
   #define CLE        '\203'       /* 131 203 83 <:    */
   #define CCEIL      '\204'       /* 132 204 84 >.    */
   #define CMAX       '\204'       /* 132 204 84 >.    */
   #define CGE        '\205'       /* 133 205 85 >:    */

Using mnemonics such as CPLUS and CASGN instead of '+' and '\200' makes the source code more readable and more amenable to automatic manipulation.

The 3-row table spell in file ws.c associates letter sequences with IDs. The rows correspond to letters in the range ASCII 32 to 127, those letters inflected by a period, and those letters inflected by a colon; table entries are IDs. Thus:

   static C spell[3][68]={
    '=',     '<',     '>',     '_',     '+',     '*',      ...,

For example, the first column specifies that =. has the ID CASGN (assignment) and =: the ID CGASGN (global assignment).

spell is used by functions spellin and spellout: given a string (e.g. =:), spellin computes the ID (CASGN); given the ID, spellout computes the corresponding string.

Using the information computed by wordil, functions tokens and enqueue transform a string into a list of nouns, verbs, adverbs, conjunctions, etc. The next step is to parse this "tokenized" form of the sentence.


Parsing occurs after word formation and is controlled by function parse and table cases in file p.c. cases is a direct translation of the parse table in Section II E of the dictionary:

   #define AVN   (     ADV+VERB+NOUN)
   #define EDGE  (MARK+ASGN+LPAR)

   PT cases[] = {
    EDGE,      VERB,      NOUN, ANY,       monad,   ..., 1,2, ...,
    EDGE+AVN,  VERB,      VERB, NOUN,      monad,   ..., 2,3, ...,
    EDGE+AVN,  NOUN,      VERB, NOUN,      dyad,    ..., 1,3, ...,
    EDGE+AVN,  VERB+NOUN, ADV,  ANY,       adv,     ..., 1,2, ...,
    EDGE+AVN,  VERB+NOUN, CONJ, VERB+NOUN, conj,    ..., 1,3, ...,
    EDGE+AVN,  VERB,      VERB, VERB,      trident, ..., 1,3, ...,
    EDGE,      CAVN,      CAVN, CAVN,      trident, ..., 1,3, ...,
    EDGE,      CAVN,      CAVN, ANY,       bident,  ..., 1,2, ...,
    NAME+NOUN, ASGN,      CAVN, ANY,       is,      ..., 0,2, ...,
    LPAR,      CAVN,      RPAR, ANY,       punc,    ..., 0,2, ...,

The sentence to be parsed is prefaced with a marker and placed on a queue, and as parsing proceeds words are moved from the right end of the queue onto a stack. The classes of the first four words on the stack are compared to the patterns in columns 0 to 3 of cases. The first row matching in all four columns is selected; the action in column 4 is applied to the words on the stack indicated by the inclusive indices in columns 8 and 9, with the result replacing those words. If none of the rows match, the word at the end of the queue is moved onto the stack by the function move. Scanning for a matching pattern then begins anew. The process terminates when the queue is empty and none of the rules are applicable. At that time, the stack should have exactly two words: the marker and a noun, verb, adverb, or conjunction; anything else signals syntax error.

This parsing method was first described in Iverson [1983]. The parse table is a compact representation of a large amount of information; it has guided both the evolution of the language and its implementation. The following example illustrates parsing on the sentence ((i.#y)=i.~y)#y where y=:'abc'. (§ denotes the marker.)

      Queue               Stack     Action       Comment

§((i.#y)=i.~y)#y                               original sentence
§((i.#y)=i.~y)#            'aba'   13 move
§((i.#y)=i.~y)            #'aba'   13 move
§((i.#y)=i.~y            )#'aba'   13 move
§((i.#y)=i.~        'aba')#'aba'   13 move

§((i.#y)=i.        ~'aba')#'aba'   13 move
§((i.#y)=        i.~'aba')#'aba'   13 move
§((i.#y)        =i.~'aba')#'aba'   13 move
§((i.#y)        =v0 'aba')#'aba'    3 adv      v0=: i.~
§((i.#y        )=v0 'aba')#'aba'   13 move

§((i.#    'aba')=v0 'aba')#'aba'   13 move
§((i.    #'aba')=v0 'aba')#'aba'   13 move
§((    i.#'aba')=v0 'aba')#'aba'   13 move
§(    (i.#'aba')=v0 'aba')#'aba'   13 move
§(         (i.3)=v0 'aba')#'aba'    1 monad    3 -: #'aba'

§(       (0 1 2)=v0 'aba')#'aba'    0 monad    0 1 2 -: i.3
§(         0 1 2=v0 'aba')#'aba'   12 punc
§(            0 1 2=0 1 0)#'aba'    1 monad    0 1 0 -: v0 'aba'
§            (0 1 2=0 1 0)#'aba'   13 move
§                  (1 1 0)#'aba'    2 dyad     1 1 0 -: 0 1 2=0 1 0

§                    1 1 0#'aba'   12 punc
                    §1 1 0#'aba'   13 move
                           §'ab'    2 dyad


A train is an isolated phrase not interpreted by the parsing rules pertaining to verbs, adverbs, and conjunctions, and (as a matter of language design) may be assigned any meaning whatsoever. Iverson and McDonnell [1989] defined a train of three verbs as a fork and a train of two verbs as a hook. That is, if f, g, and h are verbs, then so are (f g h) and (g h), and:

         Fork                      Hook
     g          g              g          g
    / \        / \            / \        / \
   f   h      f   h          y   h      x   h
   |   |     /\   /\             |          |
   y   y    x  y x  y            y          y

Parsing rules 5, 6, and 7 deal with trains. (See Parsing.) A consequence of the rules is that a train of verbs is resolved by repeated forming a fork from the rightmost three verbs, with a final hook if the train is of even length. Likewise, a train of adverbs and conjunctions is assigned a meaning, and is resolved by repeatedly forming a group from the leftmost three adverbs or conjunctions, with a final group of two if the train is of even length. Trains are implemented by functions and variables of file cf.c. The main functions are folk and hook. (fork conflicts with UNIX usage.)

Name Resolution

During parsing, words are moved from the queue to the stack. Suppose a name xyz is being moved. If xyz is immediately to the left of a copula, it (as a name) is put on the stack. Otherwise, if xyz denotes a noun, that noun is put on the stack; if xyz denotes a verb, adverb, or conjunction, 'xyz'~ is put on the stack, to be evaluated when the verb, adverb, or conjunction is applied.

Names and their assigned values are stored in symbol tables. A symbol table is an array of type SYMB whose atoms are pairs (name,value). Functions and variables in the files s*.c work with symbols tables. In particular, symbis(a,w,symb) assigns the name a to w in the symbol table symb, and symbrd(w) "reads" the value of the name w.

Next • Previous • Index • Table of Contents