Tuning the OCaml memory allocator for large data processing jobs

Published on Saturday, June 19, 2010 and tagged with programming, ocaml, and performance.

TL;DR: setting OCAMLRUNPARAM=s=4M,i=32M,o=150 can make your OCaml programs run faster. Read on for details and how to see if the garbage collector is thrashing and thereby slowing down your program.

In my research work with GroupLens, I do a most of my coding for data processing, algorithm implementation, etc. in OCaml. Sometimes I have to suffer a bit for this when some nice library doesn’t have OCaml bindings, but in general it works out fairly well. And every time I go to do some refactoring, I am reminded why I’m not coding in Python.

One thing I have found, however, is that the default OCaml garbage collector parameters are not very well-suited for much of my work — frequently long-running data processing tasks building and manipulating large, often persistent¹ data structures. The program will run somewhat slow (although there usually isn’t anything to compare it against), but more importantly, profiling with gprof will reveal that my program is spending a substantial amount of its time (~30% or more) in the OCaml garbage collector (if memory serves, frequently in the function caml_gc_major_slice).

The OCaml memory system

OCaml manages memory through a pair of heaps called the minor and major heaps. For more details, I refer you to the excellent OCaml internals articles by Richard WM Jones. The basic idea is that most allocations happen on the minor heap, which is a small heap supporting very fast allocation. Once the minor heap runs out of space, a minor collection is triggered, which copies everything on the minor heap that’s still in use off to the major heap and resets the minor heap to empty. In conjunction with this, it will do a major slice, which is one step of the garbage collection process for the major heap.

This is an excellent design for functional programming languages. As values are built up, torn down, and rebuilt, a program will allocate many small values only to throw them away shortly. The minor-major heap design allows the allocations to be fast, and only using the slower major heap system for objects that will stick around a bit longer.

The exact behavior of the collector is tunable by a variety of parameters, either through the Gc module or by setting the OCAMLRUNPARAM environment variable.

Optimizing the minor and major collection rates

Problems can arise when you’re building up ephemeral data structures which are larger than the minor heap. The data structure won’t stay around overly long, but it is a bit too large. Relatedly, if the program is building a large in-memory data set, triggering major GC slices more often can cause static data to be walked and re-walked more often than is necessary. Therefore, I frequently see notable performance gains in my programs by changing the minor heap size from its default of 32K words (256KB on 64-bit systems) to 1M or 4M words (8MB or 32MB). This allows the program to allow significantly larger data structures that never have to see the major heap and decreases the rate at which minor collections and major slices are triggered.

The major collector can be further tuned by changing the space overhead parameter. This option effectively allows you to adjust the tradeoff between wasted memory and time spent garbage-collecting. If you’ve got a bunch of extra memory, setting this to something larger can decrease the amount of work the GC does on an ongoing basis. I frequently set it to 120 or 200 (the default is 80); we have a compute machine with lots of RAM.

Heap increment

The final parameter I tweak is the major heap increment. This controls how much memory is added to the major heap each time it needs to grow. The default is 124K words (just under 1MB). If your program is ultimately going to use 2+ GB of memory, getting there 1MB at a time is a slow process. I frequently increase this a fair amount (at least 32M words, sometimes 64M or 96M words).

Measuring memory performance

Any performance tuning is a process that should be performed with measurement and not just cargo-cult coding or parameter tweaking, and memory allocation performance is not an exception. Fortunately, the OCaml native-code compilers and runtime play well with gprof, the standard profiler included with GNU binutils. If you think your program is running slower than it should, or just wish it would run faster, build a profiling version of it (pass the -p and -g options to ocamlopt) and run gprof over it. The big thing to look for is the percentage of time your program spends in various garbage collection functions (major GC slices, minor GC cycles, and heap expansion in particular). This time is time that is not spent computing results.

Additionally, the time program/command is invaluable for seeing how long the overall operation takes. Use time, learn time, time is your friend.

Conclusion

I’ve found that tweaking the GC parameters can make my programs, particularly ones working on large data sets for extended periods of times, run faster and spend less of their time on garbage collection. I’m sure there’s a point at which further increase in these parameters slow things back down, or at least result in inordinate memory wastage. I have not done extensive testing to determine optimal values for various workloads; I’ve just found that these tweaks speed up my programs. It would be interesting to see performance curves for various GC parameters on various workloads.

Memory is cheap; researcher time waiting for programs to take an extra hour (or day) to finish is not so cheap.

Persistent in the functional programming sense, not the stored-to-disk sense.↩︎