Exploring Haskell concatMap

I've been interested recently in the performance of concatMap, a ubiquitous Haskell primitive for list manipulation. This started when I noticed that replacing concatMap with an open-coded version with the mapped function inlined was speeding up some of my code. Further investigation made me wonder why the current GHC Prelude defines concatMap as

concatMap f = foldr ((++) . f) []

instead of the rather more obvious

concatMap f = concat . map f

I assumed it was performance, but in this case wouldn't getting rid of the explicit append nodes via

concatMap f = cmap where
    cmap [] = []
    cmap (x : xs) = accum (f x) where
        accum [] = cmap xs
        accum (y : ys) = y : accum ys

speed it up more? Or are the accum nodes just as bad as the append nodes? Oh, and what about the Stream Fusion version of the Data.List module in Hackage? Wasn't it supposed to do this?

In the end, with the help of Jules Kongslie and others, I created a little benchmark demo that shows that all the obvious options are about the same, with Stream Fusion giving a slight performance improvement. You can browse the benchmark code or check it out from Git.

The benchmarks show a funny thing. There's a weird performance anomaly in concatMap and in the data-generating function that I built for the benchmark on my home box (Intel Core 2 Duo 6400 2.13GHz, 2GB RAM, 2MB cache, GHC 6.8.2, 6.8.3) that doesn't seem to be present on my work box (Intel Pentium 4 2.8GHz, 1GB RAM, 512KB cache, GHC 6.8.2). It doesn't seem to be related to memory usage: the GHC heap profiler says that the memory usage of the benchmark is under 20kB. This graph

runtimes for concatMap variants

shows the issue. The benchmark first generates a list of lists containing the numbers 1..n. In this graph, n is 100,000,000. Each sublist is a list of k successive integers (except the last, of course, which may be short). The function concatMap id is applied to the resulting list, and a combining xor is mapped over the resulting list of integers. In the graph, the X axis is k, and the Y axis is runtime in seconds on my home box.

  • The plot for 'none' shows the times for just running the data-generating and combining functions without any concatMap, which forms a baseline.
  • The plot for 'obvious' shows the times for running the benchmark with the "obvious" concatMap function (concat . map f).
  • The plot for 'obvstream' shows the times for obvious with stream fusion.
  • The plot for 'open' shows the times for running the benchmark with the open-coded concatMap shown above.
  • The plots for 'openinline' and 'openinlineall' show the times for running the open benchmark with varying degrees of hand-inlining.
  • The plot for 'std' shows the standard builtin haskell concatMap.
  • The plot for 'stdstream' shows std with stream fusion.

Note the weird jump at around k = 10,000 in the plot. At worst, the concatMap functions are more than 4x slower than one would expect from the form of the rest of the graph. Something obviously happens here. I'm really curious what it is.