Popcount

Everybody on the net does bit population count. I needed to do it for XCB. My criteria were:

X has always used the venerable HAKMEM 169 algorithm, but this doesn't seem very good for the above criteria. Keithp, who inserted the HAKMEM code, tells me he selected it mainly because it was the first reasonable thing he came across when he was looking for an algorithm.

Here's some code I wrote to benchmark some tries at a replacement. The Wikipedia page is a fairly good starting point. This bonus material from Hacker's Delight was a useful reference. I'd lost my copy of the book, but I've now replaced it and I can say that there's nothing better in there, which is a pretty strong statement.

I think my code is also a reasonable tutorial for benchmarking frameworks for C, illustrating some reasonable techniques. There's still some things I would fix, though: Right now

Of the things I tried the one that seems to be second-fastest on my 2.13GHz Core II Duo (generously given me by Intel—thanks much!) is

/* Divide-and-conquer with a ternary stage to reduce masking */
/* 17 ops, 2 long immediates, 12 stages */
static inline uint32_t
popcount_3(uint32_t x)
{
    uint32_t m1 = 0x55555555;
    uint32_t m2 = 0xc30c30c3;
    x -= (x >> 1) & m1;
    x = (x & m2) + ((x >> 2) & m2) + ((x >> 4) & m2);
    x += x >> 6;
    return  (x + (x >> 12) + (x >> 24)) & 0x3f;
}

This is a classic divide-and-conquer popcount. The only strange twists are the initial ternary stage, which gets rid of a masking stage by making room a little earlier, and the final ternary stage, which gives a few more available ops to a superscalar machine at the end. There's also the standard clever trick to save a mask and op in the first stage.

The slightly strange thing is that the classic binary divide-and-conquer straight off Wikipedia is quite a bit faster.

/* Classic binary divide-and-conquer popcount.
   This is popcount_2() from
   http://en.wikipedia.org/wiki/Hamming_weight */
/* 15 ops, 3 long immediates, 14 stages */
static inline uint32_t
popcount_2(uint32_t x)
{
    uint32_t m1 = 0x55555555;
    uint32_t m2 = 0x33333333;
    uint32_t m4 = 0x0f0f0f0f;
    x -= (x >> 1) & m1;
    x = (x & m2) + ((x >> 2) & m2);
    x = (x + (x >> 4)) & m4;
    x += x >>  8;
    return (x + (x >> 16)) & 0x3f;
}

Why is this strange? My CPU should certainly be able to dispatch 4 arithmetic ops per cycle and keep everything in registers. Thus, the limiting factor should be either the number of pipeline stages used or the number of immediate loads performed. However, the ternary algorithm has fewer of both of these than the binary one. I fear that the timing loop may be interfering with my results, or that the compiler or the CPU may be losing on dispatch. (Update: I had said "Now that I think about it, though, what's more likely is that the shifts by small powers of two are being done in the address unit as per standard Intel practice, where shifts by multiples of three need to be done on the ALU, reducing the number of pipeline stages for the binary version substantially." It turns out I was wrong; Mike Haertel showed me that the number of instructions is all that will matter here on an Intel processor because of false dependencies.) On a superscalar RISC processor, I'd actually expect the first algorithm to do better.

The actual performance numbers are:

$ ./popcount 1000000
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter
popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter
popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter
popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter
popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter
popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter
popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter
popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter

These numbers overall compare quite well with what you would predict based on the number of pipeline stages and arithmetic ops in the algorithms; they say astounding things about the hundreds of picoseconds (picoseconds!) an individual operation takes on a modern processor. While it isn't suitable for my immediate needs, check out the multiply-based implementation, which uses the multiply for the accumulation step at the end. Amazing that it's faster than a couple of shifts and adds on Intel's current flagship parts.

Update: At the suggestion of Bill Trost, I added 8 and 16-bit table-driven popcount implementations. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.

The problem with this is that I don't believe it. It's amazing that Intel's L1 cache is so cheap. However, one can expect that the table lookup would occasionally miss this small cache during normal usage, which would be devastating to the performance of this approach. The tight loop here is guaranteeing us that the table becomes locked in cache. Maybe when I get some more energy I'll create a more realistic testbench, and we'll see how the table-driven algorithms compare to the pure-computational ones in 2008.

Update: At the suggestion of David Taht, Mike Haertel and I investigated the anomalously good performance of the divide-and-conquer popcounts on some 64-bit machines running gcc 3.4. Turns out that this version of gcc is smart enough to notice that the original version of the popcount code didn't have a loop dependency, and that all 64-bit machines have SSE: it was running four iterations of popcount in parallel on the MMX unit. :-) Fixed by introducing the obvious dependency.