Guides/AVX

From J Wiki
Jump to navigation Jump to search

Background

Development of the J interpreter started in 1989. Much of the code was written in the 1990s, with 1990s CPUs in mind. As processors have developed, some of this code has become dated.

Processor Developments

The last 25 years of processor development has dramatically altered the architecture of the CPU:

  • Clock speeds have risen from 40MHz to 3GHz
  • Time to fetch from a new main-memory address has hardly improved at all
  • Caches, at least 3 levels deep, reduce average memory latency
  • To address the mismatch between memory and CPU, processors have abandoned the fetch/decode/execute model and instead moved to the out-of-order execution model (also called the dataflow model) in which instructions are issued long before they can be executed, and hang around waiting for their operands to become available
  • Branch instructions are encountered long before the condition code they refer to is known, and the processor must make an educated guess (called a branch prediction) about whether the branch will be taken or not. There is a sizable penalty for an incorrect guess.
  • New SIMD instructions have been added to allow the programmer to perform several copies of the same computation on different data values
  • New specialty instructions perform complex operations, such as CRC calculation or DES encryption, very efficiently
  • Multiplication has become comparatively fast, but division, especially integer division, has remained slow.

Updating the J interpreter

The refurbishment of the J interpreter has involved visiting important modules and bringing them up to date by

  • reordering memory accesses to avoid cache misses
  • using narrower tables where possible to keep them in cache
  • rewriting unpredictable branch instructions to use non-branching forms
  • avoiding instructions that are slow or compete for processor resources
  • using SIMD and specialty instructions where possible
  • creating new algorithms as needed when the previous ones could not be repaired

This refurbishment has been applied to the following portions of the J interpreter:

i.-family

The i.-family includes all the searching operations. There are: (x i. y), (x i: y), (~. y), (x -. y), (~: y), (x u/. y), (x e. y), (x +/@:e. y), (x +./@:e. y), (x *./@:e. y), (x (e. i. 1:) y), (x (e. i. 0:) y), (x (e. i: 1:) y), (x (e. i: 0:) y), (x I.@:e. y).

The update involved:

  • Using the CRC32 instruction to create hash keys
  • Keeping wide and narrow hash tables for better cache usage
  • Managing the hash-table values to avoid the expense of clearing them
  • Recoding the hash loops to avoid mispredicted branches
  • Using packed-bit datatype for Boolean tables when byte-Booleans would overfill cache
  • Using new algorithms when hashing the x argument would overrun cache
  • Careful instruction scheduling to avoid bottlenecks

Sort/Grade

Sort/Grade includes primitives for sorting ((/:~ y) and (\:~ y)) and grading ((/: y) and (\: y)). These primitives use a variety of algorithms depending on the type and range of the argument; if no special case is found, (/: y) is calculated by mergesort using indexes to the data, and (y /: y) is executed as (y ({~ /:) y). The update involved:

  • Keeping wide and narrow tables for radix and small-range sorts
  • Recoding the comparisons to avoid mispredicted branches
  • Creating a version of sort that does not use grade followed by reorder - much faster for large sorts where pointer references lose cache coherence
  • Checks to speed up processing of almost-sorted inputs

Inner Product (x +/ . * y)

Matrix multiplication requires getting the most computation out of each memory access. As processors have become faster than memory, calculating a matrix product has become limited by memory bandwidth rather than arithmetic.

The computation of (x +/ . * y) can be factored many different ways. The original J interpreter code performed a sequence of accumulating outer products (+/@:*"1) which requires one read-modify-write per outer product. At the other extreme is a sequence of inner products (row times column) which reads a row and a column per atom of result, with the same average bandwidth. A middle course, which divides the computation into a sequence of accumulating tiles, requires far less total I/O into the CPU.

The updating involved:

  • Reorganizing the computation to use accumulating tiles
  • Using SIMD instructions for the computation
  • Using prefetch to make sure all reads hit cache

Not only is the reordered computation faster than the original, it gets faster as the CPU does, indicating that the computation is still not memory-limited. The AVX2 instruction set, and further processor improvements, will produce even faster results in future releases.

Summary

The refurbishment of the J interpreter, though called "AVX", gets about half its improvements from specialty and SIMD instructions, and the other half from careful cache management and instruction scheduling. As time and manpower permit, the project will continue to other parts of the interpreter, and take advantage of continuing advances in CPU architecture.

The performance improvements cannot be summarized in a single number, or even a handful of numbers, because they depend on the size, type, shape, and values of arguments. Much of the reworked code runs at least twice as fast as the original. The benchmark comparisons between 8.05 and 8.06 with AVX go into more detail.

It is safe to say that you should use the AVX code if your system supports it.