Exploring Haskell concatMap
I've been interested recently in the performance of concatMap, a ubiquitous Haskell primitive for list manipulation. This started when I noticed that replacing concatMap with an open-coded version with the mapped function inlined was speeding up some of my code. Further investigation made me wonder why the current GHC Prelude defines concatMap as
concatMap f = foldr ((++) . f) []
instead of the rather more obvious
concatMap f = concat . map f
I assumed it was performance, but in this case wouldn't getting rid of the explicit append nodes via
concatMap f = cmap where
cmap [] = []
cmap (x : xs) = accum (f x) where
accum [] = cmap xs
accum (y : ys) = y : accum ys
speed it up more? Or are the accum nodes just as bad as the append nodes? Oh, and what about the Stream Fusion version of the Data.List module in Hackage? Wasn't it supposed to do this?
In the end, with the help of Jules Kongslie and others, I created a little benchmark demo that shows that all the obvious options are about the same, with Stream Fusion giving a slight performance improvement. You can browse the benchmark code or check it out from Git.
The benchmarks show a funny thing. There's a weird
performance anomaly in concatMap and in the
data-generating function that I built for the benchmark on
my home box (Intel Core 2 Duo 6400 2.13GHz, 2GB RAM, 2MB
cache, GHC 6.8.2, 6.8.3) that doesn't seem to be present on
my work box (Intel Pentium 4 2.8GHz, 1GB RAM, 512KB cache,
GHC 6.8.2). It doesn't seem to be related to memory usage:
the GHC heap profiler says that the memory usage of the
benchmark is under 20kB. This graph

shows the issue. The benchmark first generates a list of
lists containing the numbers 1..n. In this graph, n is
100,000,000. Each sublist is a list of k successive
integers (except the last, of course, which may be short).
The function concatMap id is applied to the resulting
list, and a combining xor is mapped over the resulting
list of integers. In the graph, the X axis is k, and the
Y axis is runtime in seconds on my home box.
- The plot for
'
none' shows the times for just running the data-generating and combining functions without anyconcatMap, which forms a baseline. - The plot for '
obvious' shows the times for running the benchmark with the "obvious"concatMapfunction(concat . map f). - The plot for '
obvstream' shows the times forobviouswith stream fusion. - The plot for '
open' shows the times for running the benchmark with the open-codedconcatMapshown above. - The plots for '
openinline' and 'openinlineall' show the times for running theopenbenchmark with varying degrees of hand-inlining. - The plot for '
std' shows the standard builtin haskellconcatMap. - The plot for '
stdstream' showsstdwith stream fusion.
Note the weird jump at around k = 10,000 in the plot. At
worst, the concatMap functions are more than 4x slower
than one would expect from the form of the rest of the
graph. Something obviously happens here. I'm really
curious what it is.