Blog Articles 176–180

Tuning the OCaml memory allocator for large data processing jobs

TL;DR: setting OCAMLRUNPARAM=s=4M,i=32M,o=150 can make your OCaml programs run faster. Read on for details and how to see if the garbage collector is thrashing and thereby slowing down your program.

In my research work with GroupLens, I do a most of my coding for data processing, algorithm implementation, etc. in OCaml. Sometimes I have to suffer a bit for this when some nice library doesn’t have OCaml bindings, but in general it works out fairly well. And every time I go to do some refactoring, I am reminded why I’m not coding in Python.

One thing I have found, however, is that the default OCaml garbage collector parameters are not very well-suited for much of my work — frequently long-running data processing tasks building and manipulating large, often persistent1 data structures. The program will run somewhat slow (although there usually isn’t anything to compare it against), but more importantly, profiling with gprof will reveal that my program is spending a substantial amount of its time (~30% or more) in the OCaml garbage collector (if memory serves, frequently in the function caml_gc_major_slice).

My first OCaml syntax extension

Preface: In this post, I describe my adventures figuring out how to write a syntax extension for the OCaml programming language and attempt to provide something of a tutorial on writing a basic extension. I assume that you’re somewhat familiar with basic parsing technology and context-free grammars — if not, a good tutorial on parser construction with a tool like Yacc would be worth a read first.

One of the oft-touted benefits of OCaml is Camlp4, a pre-processor that facilitates extending the OCaml syntax to provide natural support for various constructions. This has been used for a variety of purposes, such as database type-checking, monad sugaring, and logging. In the hands of a capable author, a variety of wonders can be introduced to the OCaml language.

I’ve used syntax extensions for some time now, particularly PGOCaml and pa_lwt, to make much life with OCaml easier. I’d never written one, however, and found the documentation and other relevant material rather intimidating. Camlp4 documentation is somewhat hard to find, particularly for the current version (with OCaml 3.10, they made significant backwards-incompatible changes to Camlp4; much of the available tutorial and reference material was thus somewhat obsolete). The documentation that was around I find difficult to start with, particularly since I want to understand what the code I write does and not just cargo-cult it.

But I finally bit the bullet and learned. And when all was said and done, I have 13 lines of code which provide a small sugar — sort of a minimal syntax extension. This extension provides pattern matching over lazy lists, much like llists but far simpler (and based on the Batteries lazy list module). Here it is, in its entirety, and then I’ll explain how it works and what’s needed to get stared with the bare basics of extending OCaml syntax:

Object-Oriented Spaghetti

Note: since writing this essay in 2007, my understanding of object-oriented programming and of separation of concerns has evolved substantially. I think that some of the concerns I raised in this essay are still valid, and that it is quite easy to create unreadable messes of objects, but no longer hold to as strong a version of the final conclusion.

A long time ago ago, Simula was created. From it came Smalltalk, and C++, followed by Java and a host of other languages sporting this new programming paradigm: object-oriented programming. Objects are everywhere — most new/modern languages, at least in the mainstream, are based on them — and are used for everything. In Java, all the core data structures are implemented in an object-oriented fashion.

I’m not convinced that all this is a good thing. In fact, I submit that excessive use of object-oriented principles leads to a new kind of spaghetti code, rendering programs perhaps as unreadable as when implemented with unscrupulous GOTOs. OK, maybe not quite, but it can still be pretty bad.

An important facet of programming and abstraction design is separation of concerns. Separation of concerns is the idea that different concerns or aspects of a program should be kept separate. One example would be separating the type-checking logic of an evaluator from actual evaluation logic. Or separating the business logic from the report generation in a business application.

Java and Pieces

Note: This article was originally a blog post entitled “Java stinks. Really.” I have since come to find Java a very good platform, and the Java language a reasonable and comfortable, if verbose, language to work in, so I no longer have the sentiment that it stinks. That said, the core criticism I make in this article still stands, and makes Java less useful in certain situations, and possibly less desirable for some programmers. The text, however, is unchanged.

I’ve never been a fan of those “Why XYZ is better than ABC language” posts that crop up all over the Internet. Usually, as soon as one is posted, someone else comes along and says that the first poster doesn’t have a clue, and frequently they’re right.

Also, I’ve been apprehensive of peoples’ attempts to compare Java and C++. I’ve said for some time that anyone who says that Java is just like C++ doesn’t know C++ and probably doesn’t know Java, and I’m still sticking by that. They’ve got syntactic similarity (a lot, in fact), but their semantic similarity (which is what I believe is actually relevant in language comparison) is slim. Java is much better compared for similarities with Python or Objective-C, although it is stricter than either of those languages (take my Objective-C statements with a grain of salt; I’ve only dabbled in and read about the language without actually using it for anything).

Lastly, I consider Java to be a decent language from a design perspective. It is extremely clean (to a fault) and has simple semantics (again, to a fault). It takes care of many messy things for the programmer, and has a large and largely-useful library base.

E-mail Signatures (GnuPG/PGP)

In God we trust—all others must submit an X.509 certificate.

 — Charles Forsythe

If you’ve gotten an e-mail from me recently, you’ve likely noticed a strange attachment accompanying it. Perhaps you’ve even reached this page from my signature, wondering what that file is and what you’re supposed to do with it. This page will serve to explain what these files are, and why they’re a good idea.

I cryptographically sign my e-mails (well, most of them anyway). It is a way of providing proof that I am the author of a message, and a way to verify that you have received an authentic message from me. Further, I encourage everyone to sign their e-mails; I will also willingly accept encrypted e-mail. Information on obtaining my keys is provided at the end of this document.