How to write an Executable Grammar
There has been quite some noise about executable grammars recently. Executable grammars are also known as parser combinators. In today’s installment I will show you how to write your own executable grammar. You will see that it is as simple as using a parser generator like Yacc and Antlr. Actually, I’d say it is even simpler. Me myself, I never figured out how to use these parser generator beasts. They are just too arcane. Yet, I wrote my own executable grammar in less than 3 hours.
In a first step, we gonna write an executable grammar for the expression
raining & ( dogs | cats ) *
There we go.
The basic idea of an executable grammar is to build a parser’s grammar as a tree of objects that know how to parse their grammar production. How? Just open the famous Design Pattern book on page 243, and the recipe is at your fingertips: the Interpreter pattern.

We will step through that pattern today, implementing a scanner-free parsing expression grammar. But there is more magic around the corner. In a follow up we will expose our parser combinators as a DSL, and eventually, use the black art of memoization to turn them into a full-fledged packrat parser running in linear time.
Today’s language of choice is Ruby, so we start with:
class Expression
def initialize(regexp)
@regexp = regexp
end
def match(input)
input.scan @regexp
end
end
The class Expression is the basic building block of our executable grammar. Each instance of Expression knows how to parse itself, and is thus a little parser of its own. So far, expressions are nothing more than glorified regular expressions. Later, we will add subclasses to match sequence, alternation and repetition patterns.
We must require ’strscan’ as we are using StringScanner#scan to match regular expressions. StringScanner keeps a scan pointer. When scanning for a regular expression, the match must occur at the character after the scan pointer. If a regular expression matches, the scan pointer moves just beyond the last character of the match, ready to scan again from the next character onwards. In other words, StringScanner consumes the input.
require 'strscan'
Example 1: the following code will successfully match the first token of our sample input.
input = StringScanner.new "raining cats cats dogs cats" grammar = Expression.new /raining/ p grammar.match(input)
So far our expressions are quite useless for real parsing. A real parser must not only consume its input on success, but also backtrack on failure. Not to mention such fancy magic as whitespace skipping.
In order to backtrack on failure, we are going to rewrite Expression#match. This is done by saving the input’s position before attempting to perform a match, and resetting it upon failure. Hence, we rewrite the match method as follows
save = input.pos match = input.scan @regexp input.pos = save if match.nil? return match
As we are later going to redefine match in each subclass, it would be quite painstaking to repeat this boilerplate code all the time. Patterns to the rescue! There is a nice pattern to be applied here, the Template Method pattern (GOF p 325). We delegate the actual matching to a template method called private_match, which subclasses may override to redefine the actual matching.
class Expression
def match(input)
save = input.pos
input.skip_whitespace
match = self.private_match(input)
input.pos = save if match.nil?
return match
end
def private_match(input)
input.scan @regexp
end
end
Above, we are even getting the whitespace magic done! Before any attempt to perform private_match, leading whitespace if skipped with a call to StringScanner#skip_whitespace. As StringScanner does not offer such a method by default, we extend the class dynamically as follows:
class StringScanner
def skip_whitespace
self.skip /\s+/
end
end
Next, we are going to subclasses Expression to match sequence, ordered-choice and repetition patterns. This might sound scarier than it actually is. We just follow the recipe of the Interpreter pattern (GOF p 243), adding a new subclass for each kind of pattern.
SequenceExpression matches each of its child-expressions in sequence. If any of the child-expressions fails, the whole expression fails and backtracks.
class SequenceExpression < Expression
def initialize
@children = []
end
def & expr
@children << expr; self
end
def private_match(input)
@children.collect { |expr|
match = expr.match input
return if match.nil?
match
}
end
end
OrderedChoiceExpression will return the first successful match among its child-expressions. If none of the child-expressions succeeds, the whole expression fails.
NB: for those familiar with context free grammars, there is a gotcha here. The ordered-choice pattern of PEG grammars differs from a CFG’s alternation pattern. Given a series of alternatives, PEG will take the first match, where as CFG will walk both paths and choose the fullest match. For a CFG, E = (ab | a) and F = (a | ab) are the same. For a PEG they differ: both branches of E possibly match, whereas the second branch of F will never match since the first branch is always taken.
class OrderedChoiceExpression < Expression
def initialize
@children = []
end
def | expr
@children << expr; self
end
def private_match(input)
@children.each { |expr|
match = expr.match input
return match if not match.nil?
}
return nil
end
end
RepetitionExpression tries to match its child-expression as many times as possible. Even if the child-expression does not match at all, the whole expression succeeds.
class RepetitionExpression < Expression
def initialize(expression)
@child = expression
end
def private_match(input)
matches = []
while match = @child.match(input)
matches << match
end
return matches
end
end
Eventually, we apply the Factory Method pattern (GOF p 107) to add support for these patterns to the base class.
class Expression
def & expr
SequenceExpression.new & self & expr
end
def | expr
OrderedChoiceExpression.new | self | expr
end
def star
RepetitionExpression.new self
end
end
Example 2: the following code will successfully match the entire grammar that has initially been given as an example. First, we create expressions for the terminals “raining”, “cats”, and “dogs”. Then, we combine them using our little factory methods, thus creating a full parser.
input = StringScanner.new("raining cats cats dogs cats")
raining = Expression.new /raining/
cats = Expression.new /cats/
dogs = Expression.new /dogs/
grammar = raining & (dogs | cats).star
p grammar.match(input)
And done!
Well, not completely done. In order to implement a full parsing expression grammar we must add four additional subclasses: OneOrMoreExpression, ZeroOrOneExpression, AndPredicate, and NotPredicate. Their implementation follows the same schema as RepetitionExpression. For details, please refer to the full source code of today’s installment.
April 24th, 2008 at 16:17
Instead of using some lib to start in the beginning you could also just add a ‘^’ to input-regexp and use that… Something like:
which would then do something like:
def initialize(reg) @reg = Regexp.new("^#{reg}") endApril 24th, 2008 at 19:47
Had this in an earlier version. Alas, regular expressions apply on whole strings only. There is not way to say “starting at”. Which led to quite some ugly boilerplate code. Consume input, advance scan position, ecetera.
NB strscan belongs to Ruby’s corelib.
May 1st, 2008 at 08:36
[...] we continue the executable grammar series. We turn last time’s executable PEG parser into a DSL, that is, a domain specific language. Being executable, our parser is already half way [...]
May 13th, 2008 at 19:49
[...] we finish our executable grammar series. So far, we implemented an executable PEG parser, and we turned our parser into a DSL language. What remains to be done, is taming the big O [...]
June 3rd, 2008 at 15:19
I found the API rather confusing, since matching may return (i) a string, (ii) a list of strings, (iii) nil, (iv) a boolean.
I rewrote the examples in Squeak so that matching returns either (i) a list of strings, or (ii) nil.
Otherwise the translation was pretty straightforward. See: http://www.squeaksource.com/PEG.html
June 3rd, 2008 at 15:25
Another thing that I found confusing is the use of regexes *within* the expressions. Surely the point is that the executable grammars subsume regexes and all such. In fact, the only place you rely on regexes is to eat whitespace, but you could do this with another parser object …