Popcount
Everybody on the net does bit population count. I needed to do it for XCB. My criteria were:
- Should be fast even on cruddy machines (so limit operators to integer ones and try to avoid multiply and especially divide).
- Should be deterministically fast in the worst case.
- Should be comprehensible / maintainable.
X has always used the venerable HAKMEM 169 algorithm, but this doesn't seem very good for the above criteria. Keithp, who inserted the HAKMEM code, tells me he selected it mainly because it was the first reasonable thing he came across when he was looking for an algorithm.
Here's some code I wrote to benchmark some tries at a replacement. The Wikipedia page is a fairly good starting point. This bonus material from Hacker's Delight was a useful reference. I'd lost my copy of the book, but I've now replaced it and I can say that there's nothing better in there, which is a pretty strong statement.
I think my code is also a reasonable tutorial for benchmarking frameworks for C, illustrating some reasonable techniques. There's still some things I would fix, though: Right now
- The op is called in a tight loop, so self-pipelining is in play. This isn't very realistic. However, it's hard to build a reasonable set of calling environments, and hard to calibrate out their effects.
- I only report elapsed time. This is probably OK for this application, but it would be easy to add user and system time reporting.
- The user gives the number of iterations of a fixed-size block. I should do the x11perf thing and have the user supply a desired number of seconds instead, and use the preheat loop to calibrate. (Update: I now do a sort of cheesy approximation to this. See the source.)
Of the things I tried the one that seems to be second-fastest on my 2.13GHz Core II Duo (generously given me by Intel—thanks much!) is
This is a classic divide-and-conquer popcount. The only strange twists are the initial ternary stage, which gets rid of a masking stage by making room a little earlier, and the final ternary stage, which gives a few more available ops to a superscalar machine at the end. There's also the standard clever trick to save a mask and op in the first stage./* Divide-and-conquer with a ternary stage to reduce masking */ /* 17 ops, 2 long immediates, 12 stages */ static inline uint32_t popcount_3(uint32_t x) { uint32_t m1 = 0x55555555; uint32_t m2 = 0xc30c30c3; x -= (x >> 1) & m1; x = (x & m2) + ((x >> 2) & m2) + ((x >> 4) & m2); x += x >> 6; return (x + (x >> 12) + (x >> 24)) & 0x3f; }
The slightly strange thing is that the classic binary divide-and-conquer straight off Wikipedia is quite a bit faster.
Why is this strange? My CPU should certainly be able to dispatch 4 arithmetic ops per cycle and keep everything in registers. Thus, the limiting factor should be either the number of pipeline stages used or the number of immediate loads performed. However, the ternary algorithm has fewer of both of these than the binary one. I fear that the timing loop may be interfering with my results, or that the compiler or the CPU may be losing on dispatch. (Update: I had said "Now that I think about it, though, what's more likely is that the shifts by small powers of two are being done in the address unit as per standard Intel practice, where shifts by multiples of three need to be done on the ALU, reducing the number of pipeline stages for the binary version substantially." It turns out I was wrong; Mike Haertel showed me that the number of instructions is all that will matter here on an Intel processor because of false dependencies.) On a superscalar RISC processor, I'd actually expect the first algorithm to do better./* Classic binary divide-and-conquer popcount. This is popcount_2() from http://en.wikipedia.org/wiki/Hamming_weight */ /* 15 ops, 3 long immediates, 14 stages */ static inline uint32_t popcount_2(uint32_t x) { uint32_t m1 = 0x55555555; uint32_t m2 = 0x33333333; uint32_t m4 = 0x0f0f0f0f; x -= (x >> 1) & m1; x = (x & m2) + ((x >> 2) & m2); x = (x + (x >> 4)) & m4; x += x >> 8; return (x + (x >> 16)) & 0x3f; }
The actual performance numbers are:
These numbers overall compare quite well with what you would predict based on the number of pipeline stages and arithmetic ops in the algorithms; they say astounding things about the hundreds of picoseconds (picoseconds!) an individual operation takes on a modern processor. While it isn't suitable for my immediate needs, check out the multiply-based implementation, which uses the multiply for the accumulation step at the end. Amazing that it's faster than a couple of shifts and adds on Intel's current flagship parts.$ ./popcount 1000000 popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter
Update: At the suggestion of Bill Trost, I added 8 and 16-bit table-driven popcount implementations. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.
The problem with this is that I don't believe it. It's amazing that Intel's L1 cache is so cheap. However, one can expect that the table lookup would occasionally miss this small cache during normal usage, which would be devastating to the performance of this approach. The tight loop here is guaranteeing us that the table becomes locked in cache. Maybe when I get some more energy I'll create a more realistic testbench, and we'll see how the table-driven algorithms compare to the pure-computational ones in 2008.
Update: At the suggestion of David Taht, Mike Haertel and I investigated the anomalously good performance of the divide-and-conquer popcounts on some 64-bit machines running gcc 3.4. Turns out that this version of gcc is smart enough to notice that the original version of the popcount code didn't have a loop dependency, and that all 64-bit machines have SSE: it was running four iterations of popcount in parallel on the MMX unit. :-) Fixed by introducing the obvious dependency.