Preface: In this post, I describe my adventures figuring out how to write a syntax extension for the OCaml programming language and attempt to provide something of a tutorial on writing a basic extension. I assume that you're somewhat familiar with basic parsing technology and context-free grammars --- if not, a good tutorial on parser construction with a tool like Yacc would be worth a read first.
One of the oft-touted benefits of OCaml is Camlp4, a pre-processor that facilitates extending the OCaml syntax to provide natural support for various constructions. This has been used for a variety of purposes, such as database type-checking, monad sugaring, and logging. In the hands of a capable author, a variety of wonders can be introduced to the OCaml language.
I've used syntax extensions for some time now, particularly PGOCaml and pa_lwt, to make much life with OCaml easier. I'd never written one, however, and found the documentation and other relevant material rather intimidating. Camlp4 documentation is somewhat hard to find, particularly for the current version (with OCaml 3.10, they made significant backwards-incompatible changes to Camlp4; much of the available tutorial and reference material was thus somewhat obsolete). The documentation that was around I find difficult to start with, particularly since I want to understand what the code I write does and not just cargo-cult it.
But I finally bit the bullet and learned. And when all was said and done, I have 13 lines of code which provide a small sugar --- sort of a minimal syntax extension. This extension provides pattern matching over lazy lists, much like llists but far simpler (and based on the Batteries lazy list module). Here it is, in its entirety, and then I'll explain how it works and what's needed to get stared with the bare basics of extending OCaml syntax:
open Camlp4.PreCast;; open Syntax;; EXTEND Gram GLOBAL: patt ; patt: LEVEL "simple" [ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ]; patt: LEVEL "::" [ [ p1 = SELF; "%::"; p2 = SELF -> <:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ; END
First, a bit of review on what Camlp4 is and does. According to its wiki page, its name stands for the Caml Pre-Processor and Pretty Printer. It has two main components (that we care about): an extensible parser and an output or pretty-printing facility. It also supports filters and transformations in the middle. You can think about the extensible parser portion of it as being a form of a parser generator like Yacc or Menhir, but it allows grammars to be extended by loading more modules. So when we use an OCaml syntax extension, we tell camlp4 to load the base OCaml grammar, load modules providing the extensions we want, and load a printer producing output that the compiler can compile. Writing a syntax extension, therefore, is a matter of writing a module which extends the base grammar with additional productions and nonterminals to realize the desired syntax.
Back to the code. The first two lines are just preamble stuff to make
the Camlp4 APIs available.
Camlp4.PreCast provides the core
facilities needed to extend the OCaml grammar in Camlp4 (I'm not sure
what all it does, or why it's called PreCast; the Gallium
tutorial provides some info).
Syntax makes OCaml syntax
elements available (such as what we're looking for --- the
production in the context-free grammar defining the OCaml syntax,
responsible for producing pattern-matching constructs).
After that comes the meat of the program. The basic operation to add
new syntax to OCaml is to “extend” the grammar (surprise, surprise).
So what we're going to do is extend the basic OCaml grammar (defined
by the module
Gram) with two new productions for the
nonterminal (new pattern matching constructs!). The
seems to tell the parser generator logic that
patt is a global
nonterminal (as opposed to a local one, which I don't fully understand
but presume is a nonterminal used in a particular extension which
won't clobber another local nonterminal by the same name in another
extension). To do this, we say
EXTEND Gram GLOBAL: patt ;
Next, we add our two productions. The first oddity is the
directive. The Camlp4 parser supports multiple “levels” of
productions for a nonterminal. These levels can be used to support
things like operator precedence (by having it try to match addition
before multiplication --- thinking top-down, that's the way you want
it). It tries all the productions in one level before trying any in
the next level. By looking at the base OCaml grammar in the
Camlp4OcamlParser file in the OCaml source tree, you can find the
nonterminals and levels where your new syntax should be inserted. In
this case, we want to add two productions in pattern matches:
[%] as the null lazy list, and recognize
%:: as the lazy
list cons operator. So we put a production
patt -> [%] on the
simple level, used for basic objects, and the
patt -> patt %:: patt production at the same level as the normal cons operator
patt: LEVEL "simple" [ [ "[%"; "]" -> (* action *) ] ] ; patt: LEVEL "::" [ [ p1 = SELF; "%::"; p2 = SELF -> (* action *) ] ] ;
The productions are written in a fairly straightforward fashion. We
write things that look like lists of lists of productions (the code is
itself parsed with a syntax extension providing extensible parser
constructs, so this works). It is effectively a list of levels, each
of which is a list of productions. We ignore that for now, and just
use a list containing a single level definition (which could contain
multiple productions). So the first one adds
[%] to the
level; it's added as two terminals because the lexer splits
]; this isn't a problem, we'll just recognize
them in sequence, but it does allow
[% (* foo *) ] to stand for lazy
nil. For other production, we allow two patterns (identified as
p2, both produced via the
patt nonterminal referenced as
SELF) separated by the terminal
Of course, with our productions we need semantic actions. The
semantic actions are on the right-hand side of the arrow (
->) in the
production definitions. What these actions have to accomplish in our
context is return the appropriate plain OCaml code for the new
constructs. We do this via another Camlp4 feature called
quotations. In this context, they let us generate new pieces of
parsed OCaml syntax (defined by the abstract syntax tree) by writing
out the OCaml code rather than constructing AST nodes. So, we use the
patt quotation, which I think tells Camlp4 that we're constructing
AST nodes to go inside a pattern match construct, and write out the
real OCaml code referencing the relevant constructors in the
[%] becomes a match against
lazy pattern match keyword to force lazy values
patt: LEVEL "simple" [ [ "[%"; "]" -> <:patt< lazy BatLazyList.Nil >> ] ]
%:: translates to a match with
$p2$ expressions substitute the matched
head and tail patterns into the abstract syntax we're generating:
patt: LEVEL "::" [ [ p1 = SELF; "%::"; p2 = SELF -> <:patt< lazy BatLazyList.Cons($p1$,$p2$) >> ] ] ;
And that's it (well, with the closing
END). Build it with the following command:
ocamlc -c -I +camlp4 -pp camlp4of pa_llist.ml
and then use it to pre-process and print a source file and see the new constructs get expanded:
camlp4 -I +camlp4 -parser o -parser op -printer o pa_llist.cmo test_file.ml
A few final notes:
- You can build the syntax extension with
camlp4.quotations.opackages are needed, along with the
- It looked to me like it should be possible to put the level inside the list when defining my new productions, but it threw a warning and didn't work. I'm not sure what's up with that.
Hopefully this helps someone break the ice with syntax extensions.
Once you get through what needs to be done, they seem to be really
quite simple to create. For further reading, see the documentation
and tutorials on the Camlp4 wiki page, as well as Martin
Jambon's tutorial based on the old Camlp4 (but still helpful
for learning the concepts), the llists package which I used to get
some idea of how to do this lazy list thing, and the files in the
camlp4/examples directory in the OCaml source distribution.
Finally finally: I'm still learning this stuff. If I've gotten something wrong, please let me know.