# Essays/ImplementingInnerProduct

*by Bob Bernecky, communicated by Ian Clark. April 2021.*

A few people at IPSA (I was not among them, alas. My days in supercomputing lay ahead.) implemented a STAR APL, an APL interpreter for the CDC STAR-100 supercomputer, then being designed and built just outside Toronto. This machine had a memory-to-memory architecture (no registers, vector or otherwise), as was fairly common at the time (IBM 1620, IBM 1401). It took a long time for a STAR instruction to get started, but it ran at a very good clip then, much like typical APL interpreters.

Hence, just like good APL code, good STAR code encouraged minimizing instruction counts to get more results per op by vector ops.

The crew implementing STAR APL realized that a row-column scalar inner product was not going to work well, so somebody (I don't know who, but would like to find out, so that I can give them credit in the future...) tweaked the computation loop order so that, for Z←⍺ F.G ⍵

- each element of ⍺ was fetched exactly once - that element would be applied against an entire row of ⍵, scalar-vector:

t←⍺[k;m] G ⍵[i;] - the resulting vector, t, would be applied vector-vector: t2←Z[j;] G t

If the STAR was like other CDC/CRAY architectures, it hardware-fused those two ops, so never actually generated t. [They had a phrase to describe that generalized fusion capability, but I can't remember what it was called.]

Alas, the STAR's Achilles' Heel was the slow startup time for instructions. This meant that it worked great on big arrays, and poorly on small ones. [Does this sound like an APL interpreter?] Hence, later CDC/CRAY architectures had much-improved scalar support.

Back to Booleans: I was being pestered by one of the deep-pocket IPSA customers to "fix" the dismal performance of inner product when one of the arguments was Boolean. I remembered the STAR APL algorithm, and realized that it could enable a few Good Things:

1. Application of G could work a row of ⍵ at a time, so for common G (∨ ∧ ...), we could compute 32 bits of that with one instruction (Booleans are stored one-bit per element, in ravel order.).

2. Application of F often would also allow operating on 32 bits at a time.

3. If an element of ⍺ is an identity for G, we can skip that computation step, and just use the row of ⍵.

4. Similarly, if an element of ⍺ is a "zero" for G, AND if that "zero" is a left identity for F, we can skip both computations.

And so on. The cost of the checks in 3 and 4 are amortized over an entire row of ⍵, giving us superior performance on densely stored sparse arrays.

Fast skips over elements of ⍺ are trivial to implement.

I think this was at the time when we were making substantial interpreter changes for Hitachi, so a bit of good work on inner product fit well into the picture. I convinced dba (David Allen) to do the low-level design and implementation, with wonderful results.

**Designed**is one of those computer words akin to T.S Eliot's*good writers borrow, great writers steal.*