Essays/Sort Times

From J Wiki
Jump to: navigation, search

I have observed a puzzling timing result which may be CPU dependent. On my machine, a Dell XPS L521X Intel Core i7-3632QM @ 2.2 GHz, Windows 7 Pro 64-bit,

   x=: 1e6 $ ' '
   y=: a. {~ ? 1e6 $ 256
   %/ 100 (6!:2)&> '/:~x' ; '/:~y'

Basically, sorting a constant vector of 1-byte ints takes 2.4 times as long as sorting a random vector of 1-byte ints.

J Forum correspondents contributed timings on various systems:

ratio        system
 2.44   Dell XPS L521X Intel Core i7-3632QM @ 2.2 GHz, Windows 7 Pro 64-bit  
 2.35   Intel i7, Win 64  
 2.21-2.62   Intel Core i7-3820QM @ 2.70GHz, Linux64  
 2.18   2.18 on Intel 2670QM @ 2.20 GHz, Win 64  
 2.00   Macbook Air, 1.8 GHz Intel Core i5  
 1.93   iPhone 5S  
 1.77   3,07 GHz Intel Core i7 950  
 1.76   Intel i7, Win 32  
 1.67   MacBook Pro 2.4 GHz Intel Core 2 Duo running JHS on Safari  
 1.52   MacBook Pro 2.4 GHz Intel Core 2 Duo, Darwin 64  
 1.12   iMac 2.93 GHz Intel Core i7  
 1.02   HP Pavilion du9700 Notebook PC, AMD Turion(tm) 64 X2 Mobile Technology TL-60  
 0.999   Intel(R) Xeon(R) CPU X3430 @ 2.40GHz, Linux64  
 0.996   iPhone 4S  
 0.977   e-machines N450 Intel Atom  
 0.969   ARMv7  
 0.896   iPhone 3GS  

The ratios divide into three groups: over 2, around 1.7, and around 1.

The core of the C implementation is a loop to count the number of occurrences of each of the 256 possibilities. The same code is executed whether the argument is a vector of a single value or of multiple values.

I1 *x;        // pointer to run through the vector argument
I4 z[256];    // table of counts for each of the 256 possibilities

The C loop generates the following machine code (Visual Studio 2012, Windows 7 Pro 64-bit):

000000013FF23740  movsx       rax,byte ptr [rbx]
000000013FF23744  lea         rbx,[rbx+1]
000000013FF23748  inc         dword ptr [rsi+rax*4+200h]
000000013FF2374F  dec         rdi
000000013FF23752  jne         foo (013FF23740h)

I consulted my local friendly hardware guru AKA RBE AKA Bob Bernecky, and he said a bunch of things, the most relevant of which seems to be:

Now, there is another little-advertised aspect of cache memories that may be doing us in here. Repeated, back-to-back stores into the same cache word (or line - I'm not sure exactly...) can cause stalls.

The following timings are consistent with Bob's comment:

   timer=: 6!:2
   3 : '100 timer ''/:~x'' [ x=. 1e6$y'&> 'a'; 'az'; 'azb'; 'azby'; (a.{~?~256); a.{~?1e5$256
0.00252941 0.00130249 0.00106575 0.000985355 0.000970781 0.000991443

Makes sense. You make the one little word work hard (in C, z[*x++]++ where *x is the same value) and it gets tired. :-)

Contributed by Roger Hui.