Sunday, August 14, 2011

Useless: Multiplication

... so I was driving, and for some reason, I thought of processors without a builtin multiply instruction. So, the obvious general-purpose-multiplicator would, of course, be something like:

u32
mul( u16 a, u16 b )
{
 u32 res = 0;

 while ( a-- ) 
 {
  res += b;
 }
 return res;
}


Well, obviously, this is slow, so I figured I could do some pre-processing - if the count variable is larger than the value variable, then I can simply swap them, so that I have less loop iterations.

u32
mulTrivial( u16 a, u16 b )
{
 u32 res = 0;
 
 if ( !b )
  return 0;
 if ( !a )
  return 0;

 if ( a > b )
 {
  u16 t = a;
  a = b;
  b = t;
 }

 while ( a-- ) 
 {
  res += b;
 }
 return res;
}


Well, this is still deadly slow. So, while driving 160kmh/100mph on the Autobahn, I thought of an optimized multiply algorithm. This is probably not original, but I'm proud of myself that I "invented" it independently. It works like this:

u32
mulOptimized1( u16 _cnt, u16 _val )
{
 u32 val = (u32) _val;
 u32 res = 0;
 
 if ( !_val )
  return 0;
 
 while ( _cnt ) 
 {
  if ( _cnt & 1 )
  {
   res += val;
   --_cnt;
  }
  else
  {
   _cnt >>= 1;
   val <<= 1;
  }
 }
 return res;
}


The basic principle behind this algorithm is that for two values X and Y, X*Y is the same as (X/2)*(2*Y). When _cnt is divisible by two, it takes advantage of the fact that multiplying and dividing by two is trivial (a single bitshift), and thus does a shift so that it requires much less loop iterations.

So I spent the few minutes making a test case, with benchmarking and checking that the results are all equal. I know, waste of time, but the results for 16777216 values are:

Algorithm 'Reference': 0.52s
Algorithm 'Trivial':   361.42s
Algorithm 'Optimized': 2.36s


I thought of some more optimizations, the first one reduces the need to do the '_cnt != 0' check every loop iteration where the lowest bit of _cnt is set:

u32
mulOptimized2( u16 _cnt, u16 _val )
{
 u32 val = (u32) _val;
 u32 res = 0;

 if ( !_val )
  return 0;

 while ( _cnt ) 
 {
  while ( !( _cnt & 1 ) )
  {
   _cnt >>= 1;
   val <<= 1;
  }
  
  res += val;
  --_cnt;
 }
 return res;
}


And another optimization I thought of is for processors that include a bit-scan instruction, and have a barrel shifter for arbitrary shifts (things like the 8086 only supported shifting by 1), and it looks like this:

u32
mulOptimized3( u16 _cnt, u16 _val )
{
 unsigned long val = (u32) _val;
 u32 cnt = (u32) _cnt;
 u32 res = 0;
 unsigned long index;

 if ( !_val )
  return 0;

 while ( _BitScanForward( &index, cnt ) )
 {
  
  cnt >>= index;
  val <<= index;
  res += val;
  --cnt;
 }
 return res;
}


(_BitScanForward returns true if cnt is non-zero, and returns the bit index in index).

Doing another benchmark run as above, the result is this:

Algorithm 'Reference':  0.52s
Algorithm 'Trivial':    366.63s
Algorithm 'Optimized1': 2.36s
Algorithm 'Optimized2': 2.39s
Algorithm 'Optimized3': 1.84s


Now sure, these results are measured on a very modern CPU with branch prediction and pipelining and a bunch of other fancy stuff, but I'm still kinda proud that the fastest method I wrote is only 3.5 times as slow as the 'Reference', which is simply the processor's native mul instruction. The bit scan trick worked very well, as I expected, but it's interesting that Optimized2 (which skips the '_cnt != 0' check when it can) is actually slower than Optimized1. I think I need to try this test on some old CPU somehow. Maybe DOSBox has some sort of mode that has cycle-accurate timing, in which case this is a much more reliable test, especially as it's testing something meant for deadly old processors.

Now, here's something very interesting. I did the above benchmarks in a Debug build. I tried this again in an optimized Release build (MSVC, obviously), and this is what I got:

Algorithm 'Reference':  0.08s
Algorithm 'Trivial':    0.17s
Algorithm 'Optimized1': 2.18s
Algorithm 'Optimized2': 2.15s
Algorithm 'Optimized3': 0.47s


Now, the Reference algo and the three Optimized algos ran as expected (the third one performed even better in relation!). However, the Trivial algorithm ran faster than all the other ones. Something wasn't right there, so I took a look at the generated assembly code. The checks and the swap remained in there, but look at what it did to the actual multiplication loop:

002E103C  movzx       eax,cx 
002E103F  movzx       ecx,dx 
002E1042  imul        eax,ecx


MSVC was smart enough to optimize that into a single multiplication instruction. I'm fucking amazed.

---

I know, this is completely useless, but a few minutes of fun none-the-less, and I learned just how smart MSVC can be in some situations.