My first OCaml syntax extension

Preface: In this post, I describe my adventures figuring out how to write a syntax extension for the OCaml programming language and attempt to provide something of a tutorial on writing a basic extension. I assume that you’re somewhat familiar with basic parsing technology and context-free grammars — if not, a good tutorial on parser construction with a tool like Yacc would be worth a read first.

One of the oft-touted benefits of OCaml is Camlp4, a pre-processor that facilitates extending the OCaml syntax to provide natural support for various constructions. This has been used for a variety of purposes, such as database type-checking, monad sugaring, and logging. In the hands of a capable author, a variety of wonders can be introduced to the OCaml language.

I’ve used syntax extensions for some time now, particularly PGOCaml and pa_lwt, to make much life with OCaml easier. I’d never written one, however, and found the documentation and other relevant material rather intimidating. Camlp4 documentation is somewhat hard to find, particularly for the current version (with OCaml 3.10, they made significant backwards-incompatible changes to Camlp4; much of the available tutorial and reference material was thus somewhat obsolete). The documentation that was around I find difficult to start with, particularly since I want to understand what the code I write does and not just cargo-cult it.

But I finally bit the bullet and learned. And when all was said and done, I have 13 lines of code which provide a small sugar — sort of a minimal syntax extension. This extension provides pattern matching over lazy lists, much like llists but far simpler (and based on the Batteries lazy list module). Here it is, in its entirety, and then I’ll explain how it works and what’s needed to get stared with the bare basics of extending OCaml syntax:

open Camlp4.PreCast;;
open Syntax;;

EXTEND Gram
  GLOBAL: patt ;

  patt: LEVEL "simple"
      [ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ];
  patt: LEVEL "::"
      [ [ p1 = SELF; "%::"; p2 = SELF ->
            <:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ;
END

First, a bit of review on what Camlp4 is and does. According to its wiki page, its name stands for the Caml Pre-Processor and Pretty Printer. It has two main components (that we care about): an extensible parser and an output or pretty-printing facility. It also supports filters and transformations in the middle. You can think about the extensible parser portion of it as being a form of a parser generator like Yacc or Menhir, but it allows grammars to be extended by loading more modules. So when we use an OCaml syntax extension, we tell camlp4 to load the base OCaml grammar, load modules providing the extensions we want, and load a printer producing output that the compiler can compile. Writing a syntax extension, therefore, is a matter of writing a module which extends the base grammar with additional productions and nonterminals to realize the desired syntax.

Back to the code. The first two lines are just preamble stuff to make the Camlp4 APIs available. Camlp4.PreCast provides the core facilities needed to extend the OCaml grammar in Camlp4 (I’m not sure what all it does, or why it’s called PreCast; the Gallium tutorial provides some info). Syntax makes OCaml syntax elements available (such as what we’re looking for — the patt production in the context-free grammar defining the OCaml syntax, responsible for producing pattern-matching constructs).

After that comes the meat of the program. The basic operation to add new syntax to OCaml is to “extend” the grammar (surprise, surprise). So what we’re going to do is extend the basic OCaml grammar (defined by the module Gram) with two new productions for the patt nonterminal (new pattern matching constructs!). The GLOBAL line seems to tell the parser generator logic that patt is a global nonterminal (as opposed to a local one, which I don’t fully understand but presume is a nonterminal used in a particular extension which won’t clobber another local nonterminal by the same name in another extension). To do this, we say

EXTEND Gram
    GLOBAL: patt ;

Next, we add our two productions. The first oddity is the LEVEL directive. The Camlp4 parser supports multiple “levels” of productions for a nonterminal. These levels can be used to support things like operator precedence (by having it try to match addition before multiplication — thinking top-down, that’s the way you want it). It tries all the productions in one level before trying any in the next level. By looking at the base OCaml grammar in the Camlp4OcamlParser file in the OCaml source tree, you can find the nonterminals and levels where your new syntax should be inserted. In this case, we want to add two productions in pattern matches: recognize [%] as the null lazy list, and recognize %:: as the lazy list cons operator. So we put a production patt -> [%] on the simple level, used for basic objects, and the patt -> patt %:: patt production at the same level as the normal cons operator ::.

patt: LEVEL "simple"
    [ [ "[%"; "]" -> (* action *) ] ] ;
patt: LEVEL "::"
    [ [ p1 = SELF; "%::"; p2 = SELF -> (* action *) ] ] ;

The productions are written in a fairly straightforward fashion. We write things that look like lists of lists of productions (the code is itself parsed with a syntax extension providing extensible parser constructs, so this works). It is effectively a list of levels, each of which is a list of productions. We ignore that for now, and just use a list containing a single level definition (which could contain multiple productions). So the first one adds [%] to the simple level; it’s added as two terminals because the lexer splits [%] into two tokens [% and ]; this isn’t a problem, we’ll just recognize them in sequence, but it does allow [% (* foo *) ] to stand for lazy nil. For other production, we allow two patterns (identified as p1 and p2, both produced via the patt nonterminal referenced as SELF) separated by the terminal %::.

Of course, with our productions we need semantic actions. The semantic actions are on the right-hand side of the arrow (->) in the production definitions. What these actions have to accomplish in our context is return the appropriate plain OCaml code for the new constructs. We do this via another Camlp4 feature called quotations. In this context, they let us generate new pieces of parsed OCaml syntax (defined by the abstract syntax tree) by writing out the OCaml code rather than constructing AST nodes. So, we use the patt quotation, which I think tells Camlp4 that we’re constructing AST nodes to go inside a pattern match construct, and write out the real OCaml code referencing the relevant constructors in the BatLazyList module. [%] becomes a match against BatLazyList.Nil, using OCaml 3.11’s lazy pattern match keyword to force lazy values transparently:

patt: LEVEL "simple"
    [ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ]

%:: translates to a match with BatLazyList.Cons. The $p1$ and $p2$ expressions substitute the matched head and tail patterns into the abstract syntax we’re generating:

patt: LEVEL "::"
    [ [ p1 = SELF; "%::"; p2 = SELF ->
          <:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ;

And that’s it (well, with the closing END). Build it with the following command:

ocamlc -c -I +camlp4 -pp camlp4of pa_llist.ml

and then use it to pre-process and print a source file and see the new constructs get expanded:

camlp4 -I +camlp4 -parser o -parser op -printer o pa_llist.cmo test_file.ml

A few final notes:

Hopefully this helps someone break the ice with syntax extensions. Once you get through what needs to be done, they seem to be really quite simple to create. For further reading, see the documentation and tutorials on the Camlp4 wiki page, as well as Martin Jambon’s tutorial based on the old Camlp4 (but still helpful for learning the concepts), the llists package which I used to get some idea of how to do this lazy list thing, and the files in the camlp4/examples directory in the OCaml source distribution.

Finally finally: I’m still learning this stuff. If I’ve gotten something wrong, please let me know.