User:Raul Miller/ParsingJ

From J Wiki
Jump to navigation Jump to search

J's parser is simple. Its reference documentation occupies one page.

That said, note that its word formation rules occupy another reference page.

And there is another issue which requires additional research to comprehend: the grammatical properties of a word or phrase.

The grammatical properties of a 'literal' or of a number are trivial -- those are nouns. However the grammar of a user defined word depends on the definition of the word.

mean=: +/ % #

sum=: +/

SUM=: +/1 2 3

Here, sum is a verb and SUM is a noun. In general, you can't know the grammar of a word until you know its definition. For words defined in the dictionary, the definitions are fixed, and you can memorize many of them in a few weeks of use. But that does not help for words that other people have defined.

So, I am advocating a naming discipline to help work around this issue: use capitalization to show the grammar of a word.

Specifically:

NOUNS should be ALL_CAPS.

verbs should be all_lower_case.

Adverbs and Conjunctions should be Capitalized, with the beginning of any embedded words marked by a capital letter.

For various reasons, these conventions should probably apply on long names only up to the first embedded underscore.

EDIT: ALLCAPS can be a bit much, and an interesting distinction between name types is that verb name resolution is deferred longer than name resolution for other name types. So perhaps it is best to just use an initial lower case letter for verb names and an initial capital letter for other name types. And, of course, existing code bases mean that we need to tolerate exceptions - so this is mostly a hint for when you need inspiration for a name.

Why?

The rules for parsing nouns and verbs are fixed. However, adverbs and conjunctions act at parse time and you not only need to understand the grammar of the word, you also need to understand the grammar of the result. This is an unavoidable study requirement. So it does not make sense, to me, to require the name to distinguish between the various cases. Note that there are a minimum of eight distinct cases:

Adverb where the result of the adverb phrase is a noun, for example: a: 1 :'9'

Adverb where the result of the adverb phrase is a verb, for example: a: 1 :'9[m[y'

Adverb where the result of the adverb phrase is an adverb, for example: a: 1 :'\'

Adverb where the result of the adverb phrase is a conjunction, for example: a: 1 :':'

Conjunction where the result of the conjunction phrase is a noun, for example a: 2 :'9' a:

Conjunction where the result of the conjunction phrase is a verb, for example a: 2 :'9[m[y' a:

Conjunction where the result of the conjunction phrase is an adverb, for example a: 2 :'\' a:

Conjunction where the result of the conjunction phrase is a conjunction, for example: a: 2 :':' a:

And these cases do not stop here. For example I could give you a conjunction which gives you an adverb result when it is used and that resulting adverb might in turn give you a conjunction result when it was used. And of course the same adverb or conjunction might give different kinds of results depending on how it is used (: is an example of this).

So, anyways, some words you just have to study before you can understand their grammar. And capitalization is as good as anything for marking this class of words.

And, once we have this convention, we can then examine existing bodies of work, to see if they follow the convention or not.

That said, note that local names might violate this convention (x and y are not capitalized, for example). But they might also follow this convention. Rules of thumb can be useful if they are used frequently enough to be a useful hint, especially when people have mostly memorized the exception to the rules.

Here's an example of an implementation which attempts to categorize names based on how well they follow this convention:

require'strings'

NB. note that this implementation does not ignore characters after an underscore
check_names=: 3 :0
  'NOUN' warn_names (#~ (~: toupper&.>)) nl__y 0
  'verb' warn_names (#~ (~: tolower&.>)) nl__y 3
  'Other' warn_names (#~ (= tolower)&{.&> +. (= toupper&.>)) nl__y 1 2
)

warn_names=: 4 :0
  if.0=#y do.return.end.
  smoutput LF,'Warning, these ''',x,''' words should be recapitalized'
  smoutput names y
)

check_names takes the boxed name of a locale as an argument, and inspects all names in that locale.

And here's an example of what it shows for the z locale:

   check_names <'z'
Warning, these 'NOUN' words should be recapitalized
Debug       adverb      conjunction dyad        monad
noun        verb

Warning, these 'verb' words should be recapitalized
AND    Endian Note   OR     XOR    rxE    toCRLF toHOST toJ

Warning, these 'Other' words should be recapitalized
bind        cuts        def         define      each
every       fapplylines inv         inverse     items
leaf        on          rows        rxapply     rxmerge
table

So, for example, instead of

name=: verb define
 ...
)

If we might instead have

name=: VERB Define
 ...
)

And the reader would be expected to study the definition of Define before understanding the grammar of that code. Or, alternatively:

name=: VERB :0
 ...
)

Of course, knowing the result of a :0 word still requires knowing the value of its left argument. In this case, the word VERB gives a very strong hint, so that's probably ok. But old timers might still habitually use 3 :0 because that reduces the ambiguity for some readers, and is fast to type.

Also, with the 'z' locale, there would be backwards compatibility issues if the old words were eliminated. But, hypothetically speaking, new words could be introduced and used while leaving the deprecated old words so that older code (which might exist on the web, or in books) continues to work.

That said, note that this sort of thinking is mostly only useful for new code. Existing code (including the pervasive 'z' locale) has requirements for backwards compatibility and may be widely propagated, and should [as a general rule] be preserved rather than made to be obsolete.