User:Tom McGuire/AITextProcessing

From J Wiki
Jump to navigation Jump to search

Introduction

AI as it is practiced in 2024 involves setting up large neural network in various configurations and simulating them on a computer. The simulation methods at their base level revolve around matrix multiplication. This would seem obvious to anyone who has played with a 3 layer backpropagation network. Inputs are fed into a layer of neuron functions. The outputs of each layer are fed into the next layer and we produce an output on the output layer. Every input goes to every neuron in the input layer. Every output goes to ever neuron in the next layer. Weights are applied to each input to a neuron and therefore the layer can be represented by a matrix of weights. The neurons are usually selected to each implement the same output function to simplify coding.

There seems to be mention of tensor operations when looking at libraries such as PyTorch, TensorFlow. llama.cpp etc. But so far in my investigation ordinary matrix manipulation and multiplication is all that is needed. The major issue becomes the scale of these operations for use in AI. Words are turned into vectors, vectors are collected in matrices, neural networks are matrices that must align on the dimension of the word vectors. The sheer size of the data matrices being pushed around in memory becomes quite large quite fast. To speed up processing of these large matrices the AI industry has come to rely on graphics cards with high numbers of GPUs. The library layers that sit on the graphics cards are essentially gclc code that gets compiled into a set of standing operations to perform in parallel thereby speeding up the task of calculating for AI.

You would think that J or array programming languages would be the obvious choice to use in this environment. Had AI been able to be built on top of the CPUs and memory that comes in a standard but high level computer it probably would be. But desktop computers have always been schizophrenic in the types of operations they perform. Graphics are applied by sending operations to a driver program that interfaces to a graphics card. That graphics card gets output on a screen for humans to interact with. This controller/peripheral paradigm splits functionality. J (and most array languages) operate on the CPU to perform general purpose calculations. The GPU programming became the domain of graphics operations and interfacing with a C/C++ driver program that operates the GPU. Further separating out the graphics space is the fact that for gaming and ray tracing you need to program the GPU directly. There is only one way to do this and that is through a shading language that gets compiled into instructions the Graphics card can understand. This gets fragmented again because Graphics cards from different manufacturers support different shading languages.

My belief is had the graphics card industry not been so fragmented. An array language may have been able to provide direct language support for graphics cards. Instead shading languages loosely resemble C language, each major manufacturer has their own version, drivers are writtent in C. Gaming engines also helped to fragment this further by writing the engine in C or C++ and then each engine providing their own scripting language or support to some other popular language.

Now AI comes along and needs the use of the Graphics card and they have gone the same route as the Gaming engine field. They write code that sits on top of the graphics card (mainly NVIDIA cards) then they write a scripting language interface in their favorite language. Due to its pupularity, its ability to adapt via its foreign function interface, Python has become the high level language of the AI space. Some of that driven by NumPy an array language implementation which adapts itself into the python object space.

Basic Tokenizer

The BasicTokenizer just gets words from a string and separates them out into their own strings. In J boxing is probably the easiest representation. In the April 2024 NYCJUG Devon McCormick was presented a beginners regata that was the beginning of this process. In summary he wanted to get white space between words down to a single TAB character in between each word. He had tried a couple of J idioms combined to make this happen and then he came to the 'cut' verb ';:'.


      spaces2TAB=: 13 : '}:;TAB,~&.>;:y'
      spaces2TAB
   [: }: [: ; (a.{~9) ,~&.> ;:
   
      samp=. 'Here  is some    text',(' ',TAB,' '),'with',TAB,'different',('  ',TAB),'kinds of whitespace    embedded in it.   '
      ;:samp
   ┌────┬──┬────┬────┬────┬─────────┬─────┬──┬──────────┬────────┬──┬───┐
   │Here│is│some│text│with│different│kinds│of│whitespace│embedded│in│it.│
   └────┴──┴────┴────┴────┴─────────┴─────┴──┴──────────┴────────┴──┴───┘
      TAB ,~&.> ;:samp
   ┌─────┬───┬─────┬─────┬─────┬──────────┬──────┬───┬───────────┬─────────┬───┬────┐
   │Here │is │some │text │with │different │kinds │of │whitespace │embedded │in │it. │
   └─────┴───┴─────┴─────┴─────┴──────────┴──────┴───┴───────────┴─────────┴───┴────┘
   
      NB. Now Devons function:
      spaces2TAB samp
   Here	is	some	text	with	different	kinds	of	whitespace	embedded	in	it.

A quick aside: Devon uses the 'each' operator to place a TAB character at the end of every boxed string. Then he drops the last tab of the sentence. I have used the insert operator ('/') in the past for these sort of things albeit in a kludgy way. In J the insert operator inserts verbs and executes the final form. Using tacit or an anonymous function you can hard code a boxed noun in between each boxed string coming out of the cut opeaator (';:'). I bring this up because, while this works, it seems that there must be a more straight forward way to do this in J. However, I haven't been able to think of it.

      NB. using Devon's sample with an initial TAB
      samp2=. TAB,'Line with    initial ',TAB,'  tab'
      samp2
   	Line with    initial 	  tab
   
      ;:samp2
   ┌────┬────┬───────┬───┐
   │Line│with│initial│tab│
   └────┴────┴───────┴───┘
   
      NB. some ways to insert a noun
      {{x,(<TAB),y}}/;: samp2
   ┌────┬─┬────┬─┬───────┬─┬───┐
   │Line│ │with│ │initial│ │tab│
   └────┴─┴────┴─┴───────┴─┴───┘
      ([ , (<TAB) , ])/;: samp2
   ┌────┬─┬────┬─┬───────┬─┬───┐
   │Line│ │with│ │initial│ │tab│
   └────┴─┴────┴─┴───────┴─┴───┘
      ([ , (<' anystring ') , ])/;: samp2
   ┌────┬───────────┬────┬───────────┬───────┬───────────┬───┐
   │Line│ anystring │with│ anystring │initial│ anystring │tab│
   └────┴───────────┴────┴───────────┴───────┴───────────┴───┘

Basic tokenization

Devon pointed out back in April the limitations of the monadic ';:' operator: "the state machine defaults to the J parser so any sequence of text is broken apart into distinct J tokens; in the case of most text, this mostly corresponds to words.". Specifically it is using the lexical analysis of J to separate out tokens. If you want to separate out English punctuation into its own elenent then 'cut' may or may not du it based on whether the character is a recognized operator or not.

      NB. ;: doesn't quite 'cut' it as a basic tokenizer - pretty close though
      ;: 'However, if I put in some text with J-syntax characters. Then: you get the following:'
   ┌───────┬─┬──┬─┬───┬──┬────┬────┬────┬─┬─┬──────┬───────────┬─────┬───┬───┬───┬──────────┐
   │However│,│if│I│put│in│some│text│with│J│-│syntax│characters.│Then:│you│get│the│following:│
   └───────┴─┴──┴─┴───┴──┴────┴────┴────┴─┴─┴──────┴───────────┴─────┴───┴───┴───┴──────────┘

Devon's example inspired me to bite the bullet and look into the dyadic use of ';:' which is the full state machine operator. In basic tokenizing you need to settle on a domain for the alphabet of the words you will recognize. Since you may want to accept more than one language then you need more than the ability to recognize just the English alphabet. White space is limited to a handful of characters and punctuation to a couple of handfuls more. It turns out we can get pretty far with a Basic Tokenizer in J by just assuming that anything that isn't in our punctuation mark set and isn't in the whitespace set is infact a letter of interest in our alphabet.

   NB. state machine to box on whitespace
   NB. punctuation string
   punc =: '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'
   
   NB. input character classes
   mBT =: 256$0                       NB. X other
   mBT =: 1 (9,10,13,a.i.' ')    }mBT NB. S whitespace
   mBT =: 2 (a. i. punc)         }mBT NB. P punctuation
   mBT =: 3 (a. i. '1234567890') }mBT NB. N numbers
   NB. 4   F End of input
   
   NB. State transition table
      sBT=: _2]\"1 }.".;._2 (0 : 0)
   ' X    S    P    N   F  ']0
    1 1  2 0  3 1  4 1 5 0  NB. 0  initial
    1 0  2 2  3 2  4 2 5 2  NB. 1  other
    1 1  2 1  3 1  4 1 5 1  NB. 2  whitespace
    1 2  2 2  3 2  4 2 5 2  NB. 3  punctuation
    1 2  2 2  3 2  4 0 5 2  NB. 4  numbers
    5 6  5 6  5 6  5 6 5 6  NB. 5  final
   )
   
      BT =: (0;sBT;mBT;0 _1 0 4) & ;:
   
      BT 'However, if I put in some text with J-syntax characters. Then: you get the following:'
   ┌───────┬─┬──┬─┬───┬──┬────┬────┬────┬─┬─┬──────┬──────────┬─┬────┬─┬───┬───┬───┬─────────┬─┐
   │However│,│if│I│put│in│some│text│with│J│-│syntax│characters│.│Then│:│you│get│the│following│:│
   └───────┴─┴──┴─┴───┴──┴────┴────┴────┴─┴─┴──────┴──────────┴─┴────┴─┴───┴───┴───┴─────────┴─┘

Just to bring this back around full circle we can actually use this sequential machine implementation to solve Devon's original problem

      BT samp
   ┌────┬──┬────┬────┬────┬─────────┬─────┬──┬──────────┬────────┬──┬──┬─┐
   │Here│is│some│text│with│different│kinds│of│whitespace│embedded│in│it│.│
   └────┴──┴────┴────┴────┴─────────┴─────┴──┴──────────┴────────┴──┴──┴─┘
      BT samp2
   ┌────┬────┬───────┬───┐
   │Line│with│initial│tab│
   └────┴────┴───────┴───┘

Once the Basic Tokenizer is working the next step is to apply one of the 3 popular tokenizers. BasePair tokenization, word-piece tokenization, sentence piece tokenization. Word-piece seems pretty popular and looking ther the llama.cpp code it is fairly straight forward to implement in J. The only issue is that it requires looping control statements in J. I haven't found a true J way to implement this.

Now for Something Completely Different - a brute force word-piece tokenizer

I originally came across some Julia code that implements a word-piece tokenizder in a nested loop. It's a short piece of code that you can find at https://github.com/tmcguirefl. However, once I understood the code in Julia I was able to find the equivalent code in llama.cpp. The following is the J implementation based off the llama.cpp code:

   NB. the following map implementation is from the old jprogramming mailing list
   NB. subject line: [Jprogramming] Dictionary with Strings as Keys
   NB. author: Ian Clark
   NB. link: http://www.jsoftware.com/pipermail/programming/2017-April/047176.html
   NB. original lookup definition:
   NB. lookup=: 4 : '>1{ x {~ ({."1 x) i. <y'
   NB.
   NB. Bill Lam
   map =: _2 ]\;0;'snow';1;'##board';2;'##ing';3
   
   NB. lookup has been changed to return 0 by taking modulo of index
   NB. the 0 index of the map is a null entry that will return 0 for the value
   lookup =: {{>1{ x {~ (#x)|({."1 x) i. <y}}
   
   NB. generate substring indexes
   sseq =: [ + [: i. -~
   
   NB. Brute force wordpiece tokenization tranlated to J from Julia implementation at:
   NB. https://github.com/SeanLee97/BertWordPieceTokenizer.jl/blob/main/src/BertWordPieceTokenizer.jl
   NB. and
   NB. from the megatron bert-tokenizer code in python
   NB. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/tokenizer/bert_tokenization.py
   NB. 
   tokenize =: 3 : 0
   word =. y
   
   tokens =: <
   subtokens =: <
   'i j is_bad' =. 0 0 0
   if. max_chars_per_word < #y do.
     <'UNK'
     return.
   end.
   while. i < #y do.
     j =. #y
     cur_sub =. 
     while. j > i do. 
       sub =. (i sseq j){word
   NB. smoutput sub
       if. i > 0 do.
         sub =. '##',sub
       end. 
       tokenid =. map&lookup sub
       if. tokenid ~: 0 do.
         cur_sub =. sub
         break.
       end.
       j =. j - 1
     end.
     if. cur_sub -:  do.
       is_bad =. 1
       break.
     end.
     tokens =. tokens,<cur_sub
   
     i =. j
   end.
   
   if. is_bad do.
     <'[UNK]'
     return.
   end.
    }. tokens
   )

I placed an smoutput stamement in this code so you can see how processing is performed. I use a match (-: operator) lookup just as proof of concept.

   NB. output of the brute force word-piece tokenizer
      tokenize 'snowboarding'
   snowboarding
   snowboardin
   snowboardi
   snowboard
   snowboar
   snowboa
   snowbo
   snowb
   snow
   boarding
   boardin
   boardi
   board
   ing
   ┌────┬───────┬─────┐
   │snow│##board│##ing│
   └────┴───────┴─────┘
   
      tokenize 'slowboarding'
   slowboarding
   slowboardin
   slowboardi
   slowboard
   slowboar
   slowboa
   slowbo
   slowb
   slow
   slo
   sl
   s
   lowboarding
   lowboardin
   lowboardi
   lowboard
   lowboar
   lowboa
   lowbo
   lowb
   low
   lo
   l
   owboarding
   owboardin
   owboardi
   owboard
   owboar
   owboa
   owbo
   owb
   ow
   o
   wboarding
   wboardin
   wboardi
   wboard
   wboar
   wboa
   wbo
   wb
   w
   boarding
   boardin
   boardi
   board
   ing
   ┌─────┬─────┬─────┬─────┬───────┬─────┐
   │[UNK]│[UNK]│[UNK]│[UNK]│##board│##ing│
   └─────┴─────┴─────┴─────┴───────┴─────┘

Trie in J

This trie implementation came about while I was investigating AI architectures specifically Transformer architecture. My major complaint is that the current state of AI is lacking in descriptive details. It is all very "blackbox"-ish. The real code hides beneath the Python subroutine APIs that everyone uses to access the underlying architecture usually written in C/C++. On a cursory internet search there are plenty of articles and discussions about using the python libraries. But very little comes up on what the underlying architecture looks like.

Digging deeper I came across a GOOGLE blog that describes the Transformer AI architecture. This architecture seems to excel at language translation. The main point of the GOOGLE article is how to break up text sentences so they can be fed into the underlying AI neural network. This is called tokenization. Words must be converted into numbers so they can be used by the neural network. There are different types of Tokenization in AI. byte pair, word-piece, sentence-piece. Word-piece Tokenization seemed popular and this was the subject of the GOOGLE blog article. They used a Trie datastructure to provide O(n) Token lookup for breaking up words into meaningful pieces.

So now with a small piece of AI architecture carved out for study. I thought well a Trie is just a Tree. J can implement Trees with arrays so their must be an implementation somewhere. Searching the J Software site I found put that NYCJUG had discussed this topic briefly in 2013. The discussio was over someone's complaint of what is missing in J and Trie was mentioned. Devon included ???? reply to that message in the NYCJUG notes. Who said J can usually implement what you need and many times if you rethink the problem you may find that J can do it in another way that is simpler.

Trie Implementation

This follows the simple matrix version of a trie described in the paper: J. Aoe, K. Morimoto and T. Sato, ‘An efficient implementation of trie structures’, Software—Practice and Experience, 22, 695–721 (1992). https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=f2ad2a67218e458a979afafb311d1d7867f3f624

   NB. trie implementation in J
   NB. both full matrix and sparse matrix implementations in one file
   NB. 
   cocurrent 'jtrie'
   NB. constants and a view function 
   NB. taken from Roger Stokes Learning J chapter 30
   NB. https://www.jsoftware.com/help/learning/30.htm
   INTEGERZERO =: 3 - 3
   INTEGERONE =: 3 - 2
   view =: 0 & $.
   
   NB. init_trie will initialized state variables and the empty trie
   NB. x - 1 for sparse matrix; 0 for full matrix
   NB. y - the alphabet to use with this trie
   create=: 3 : 0
   'sparse alphabet' =: y NB. '#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
   if. sparse do.
     trierow =: $. (1,#alphabet)$INTEGERZERO
   else.
     trierow =: (1,#alphabet)$INTEGERZERO
   end.
   
   trie =: trierow,trierow
   current_state=: 1
   max_rows=: 1
   )
   
   destroy =: 3 : 0
   NB. release any resources acquired
   
   codestroy   NB. release the instance, when it finishes execution
   )
   
   char_ins=: 3 : 0
   next_state =. (<current_state,alphabet i. y){trie
   smoutput next_state
   if. INTEGERZERO = next_state do.
     NB. adding in new transition
     trie =: trie,trierow
     max_rows=: max_rows+1
     trie=: max_rows (<current_state,alphabet i. y)}trie
     current_state =: max_rows
   else.
     NB. transition for the letter already exists
     NB. just transition to the next state
     current_state =: next_state
   end.
   )
   
   strval_ins =: 4 : 0
   NB. encode a string into the trie and place the value y
   NB. in its end recognition state
   _1 char_ins\x
   trie =: y (<current_state,0)}trie
   current_state =: 1
   )
   
   NB. inc_lookup a verb to use with \ to ratchet letters of a string
   NB. through the trie
   NB. usage to check a string: '_1 inc_lookup\<your word>'
   NB. after running current state will be in a recognition state or
   NB. state 0 indicating word not in trie
   inc_lookup =: 3 : 0
   current_state =: (<current_state,alphabet i. y){trie
   )
   
   NB. routine that returns the next state based on character presented
   NB. this is set up to be used by fold:
   NB.   1 ] F:. flookup 'bachelor'
   NB. the big difference in this usage is the fold is terminated as soon
   NB. as a failed transition occurs. This should facilitate word-piece 
   NB. tokenizing.
   flookup =: 4 : 0
   NB. y is the current state
   NB. x is the character to transition on 
   ns =. (<y,alphabet i. x){trie
   if. ns = 0 do. 
     NB. 2 reasons we get here
     NB. the string does not exist in the trie
     NB. we may have recognized a portion of the string as 
     NB. being in the trie
     NB. use Z: to interupt further folding
     1 Z: 1
     0
     return. 
   end.
   ns
   )
   
   
   NB. some routines for testing purposes
   
   NB. a naive tokenization is to just to test words against a trie
   NB. for that we can just use the '\' operator as follows
   naive_tok =: 3 : 0
   current_state =: 1
   _1 inc_lookup\y
   (<current_state,0){trie
   )
   
   NB. a naive tokenizer using fold
   NB. this hides the fact that fold will stop as soon as you hit 
   NB. a transition to the 0 state. If you run the fold as:
   NB. 1 ] F:. flookup '<string>'
   NB. the output is the complete list of state transitions. 
   NB. if the last state is 0 then unrecognized
   NB. otherwise use the last state to look up the integer token
   ftok =: 3 : 0
   current_state =: 1
   states =. current_state ] F:. flookup y
   (<(_1 {. states),0){trie
   )

![nil](/Users/tmcguire/j9.6-user/temp/graphview.png)

Handling Nunbers

Finding out how to handle numbers in text is a little difficult when searching since the point of tokenization is to translate into a number. A google search has a ton of links on the latter and very few on the former. The best review so far I have been able to find is by Andrew Carr on Twitter: https://x.com/andrew_n_carr/status/1714326003030638848 In the above link he reviews how numbers in LLM are being handled. I tend to think a [NUM] token is the best way of handling this for now.