My first OCaml syntax extension
Preface: In this post, I describe my adventures figuring out how to write a syntax extension for the OCaml programming language and attempt to provide something of a tutorial on writing a basic extension. I assume that you’re somewhat familiar with basic parsing technology and context-free grammars — if not, a good tutorial on parser construction with a tool like Yacc would be worth a read first.
One of the oft-touted benefits of OCaml is Camlp4, a pre-processor that facilitates extending the OCaml syntax to provide natural support for various constructions. This has been used for a variety of purposes, such as database type-checking, monad sugaring, and logging. In the hands of a capable author, a variety of wonders can be introduced to the OCaml language.
I’ve used syntax extensions for some time now, particularly PGOCaml and pa_lwt, to make much life with OCaml easier. I’d never written one, however, and found the documentation and other relevant material rather intimidating. Camlp4 documentation is somewhat hard to find, particularly for the current version (with OCaml 3.10, they made significant backwards-incompatible changes to Camlp4; much of the available tutorial and reference material was thus somewhat obsolete). The documentation that was around I find difficult to start with, particularly since I want to understand what the code I write does and not just cargo-cult it.
But I finally bit the bullet and learned. And when all was said and done, I have 13 lines of code which provide a small sugar — sort of a minimal syntax extension. This extension provides pattern matching over lazy lists, much like llists but far simpler (and based on the Batteries lazy list module). Here it is, in its entirety, and then I’ll explain how it works and what’s needed to get stared with the bare basics of extending OCaml syntax:
open Camlp4.PreCast;;
open Syntax;;
EXTEND Gram
GLOBAL: patt ;
patt: LEVEL "simple"
[ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ];
patt: LEVEL "::"
[ [ p1 = SELF; "%::"; p2 = SELF ->
<:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ;
END
First, a bit of review on what Camlp4 is and does. According to its wiki page, its name stands for the Caml Pre-Processor and Pretty Printer. It has two main components (that we care about): an extensible parser and an output or pretty-printing facility. It also supports filters and transformations in the middle. You can think about the extensible parser portion of it as being a form of a parser generator like Yacc or Menhir, but it allows grammars to be extended by loading more modules. So when we use an OCaml syntax extension, we tell camlp4 to load the base OCaml grammar, load modules providing the extensions we want, and load a printer producing output that the compiler can compile. Writing a syntax extension, therefore, is a matter of writing a module which extends the base grammar with additional productions and nonterminals to realize the desired syntax.
Back to the code. The first two lines are just preamble stuff to make
the Camlp4 APIs available. Camlp4.PreCast
provides the core
facilities needed to extend the OCaml grammar in Camlp4 (I’m not sure
what all it does, or why it’s called PreCast; the Gallium
tutorial provides some info). Syntax
makes OCaml syntax
elements available (such as what we’re looking for — the
patt
production in the context-free grammar defining the
OCaml syntax, responsible for producing pattern-matching
constructs).
After that comes the meat of the program. The basic operation to add
new syntax to OCaml is to “extend” the grammar (surprise, surprise). So
what we’re going to do is extend the basic OCaml grammar (defined by the
module Gram
) with two new productions for the
patt
nonterminal (new pattern matching constructs!). The
GLOBAL
line seems to tell the parser generator logic that
patt
is a global nonterminal (as opposed to a local one,
which I don’t fully understand but presume is a nonterminal used in a
particular extension which won’t clobber another local nonterminal by
the same name in another extension). To do this, we say
EXTEND Gram
GLOBAL: patt ;
Next, we add our two productions. The first oddity is the
LEVEL
directive. The Camlp4 parser supports multiple
“levels” of productions for a nonterminal. These levels can be used to
support things like operator precedence (by having it try to match
addition before multiplication — thinking top-down, that’s the way you
want it). It tries all the productions in one level before trying any in
the next level. By looking at the base OCaml grammar in the
Camlp4OcamlParser
file in the OCaml source tree, you can
find the nonterminals and levels where your new syntax should be
inserted. In this case, we want to add two productions in pattern
matches: recognize [%]
as the null lazy list, and recognize
%::
as the lazy list cons operator. So we put a production
patt -> [%]
on the simple
level, used for
basic objects, and the patt -> patt %:: patt
production
at the same level as the normal cons operator ::
.
patt: LEVEL "simple"
[ [ "[%"; "]" -> (* action *) ] ] ;
patt: LEVEL "::"
[ [ p1 = SELF; "%::"; p2 = SELF -> (* action *) ] ] ;
The productions are written in a fairly straightforward fashion. We
write things that look like lists of lists of productions (the code is
itself parsed with a syntax extension providing extensible parser
constructs, so this works). It is effectively a list of levels, each of
which is a list of productions. We ignore that for now, and just use a
list containing a single level definition (which could contain multiple
productions). So the first one adds [%]
to the
simple
level; it’s added as two terminals because the lexer
splits [%]
into two tokens [%
and
]
; this isn’t a problem, we’ll just recognize them in
sequence, but it does allow [% (* foo *) ]
to stand for
lazy nil. For other production, we allow two patterns (identified as
p1
and p2
, both produced via the
patt
nonterminal referenced as SELF
) separated
by the terminal %::
.
Of course, with our productions we need semantic actions. The
semantic actions are on the right-hand side of the arrow
(->
) in the production definitions. What these actions
have to accomplish in our context is return the appropriate plain OCaml
code for the new constructs. We do this via another Camlp4 feature
called quotations.
In this context, they let us generate new pieces of parsed OCaml syntax
(defined by the abstract syntax tree) by writing out the OCaml code
rather than constructing AST nodes. So, we use the patt
quotation, which I think tells Camlp4 that we’re constructing AST nodes
to go inside a pattern match construct, and write out the real OCaml
code referencing the relevant constructors in the
BatLazyList
module. [%]
becomes a match
against BatLazyList.Nil
, using OCaml 3.11’s
lazy
pattern match keyword to force lazy values
transparently:
patt: LEVEL "simple"
[ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ]
%::
translates to a match with
BatLazyList.Cons
. The $p1$
and
$p2$
expressions substitute the matched head and tail
patterns into the abstract syntax we’re generating:
patt: LEVEL "::"
[ [ p1 = SELF; "%::"; p2 = SELF ->
<:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ;
And that’s it (well, with the closing END
). Build it
with the following command:
ocamlc -c -I +camlp4 -pp camlp4of pa_llist.ml
and then use it to pre-process and print a source file and see the new constructs get expanded:
camlp4 -I +camlp4 -parser o -parser op -printer o pa_llist.cmo test_file.ml
A few final notes:
- You can build the syntax extension with
ocamlfind
. thecamlp4.extend
andcamlp4.quotations.o
packages are needed, along with the-syntax camlp4o
option. - It looked to me like it should be possible to put the level inside the list when defining my new productions, but it threw a warning and didn’t work. I’m not sure what’s up with that.
Hopefully this helps someone break the ice with syntax extensions.
Once you get through what needs to be done, they seem to be really quite
simple to create. For further reading, see the documentation and
tutorials on the Camlp4
wiki page, as well as Martin
Jambon’s tutorial based on the old Camlp4 (but still helpful for
learning the concepts), the llists package
which I used to get some idea of how to do this lazy list thing, and the
files in the camlp4/examples
directory in the OCaml source
distribution.
Finally finally: I’m still learning this stuff. If I’ve gotten something wrong, please let me know.