Twitter icon.
Random notes on software, programming and languages.
By Adrian Kuhn

How to write an Executable Grammar


There has been quite some noise about executable grammars recently. Executable grammars are also known as parser combinators. In today’s installment I will show you how to write your own executable grammar. You will see that it is as simple as using a parser generator like Yacc and Antlr. Actually, I’d say it is even simpler. Me myself, I never figured out how to use these parser generator beasts. They are just too arcane. Yet, I wrote my own executable grammar in less than 3 hours.

In a first step, we gonna write an executable grammar for the expression

raining & ( dogs | cats ) *

There we go.

The basic idea of an executable grammar is to build a parser’s grammar as a tree of objects that know how to parse their grammar production. How? Just open the famous Design Pattern book on page 243, and the recipe is at your fingertips: the Interpreter pattern.

Interpreter pattern.

We will step through that pattern today, implementing a scanner-free parsing expression grammar. But there is more magic around the corner. In a follow up we will expose our parser combinators as a DSL, and eventually, use the black art of memoization to turn them into a full-fledged packrat parser running in linear time.

Today’s language of choice is Ruby, so we start with:

class Expression
  def initialize(regexp)
    @regexp = regexp
  end
  def match(input)
    input.scan @regexp
  end
end

The class Expression is the basic building block of our executable grammar. Each instance of Expression knows how to parse itself, and is thus a little parser of its own. So far, expressions are nothing more than glorified regular expressions. Later, we will add subclasses to match sequence, alternation and repetition patterns.

We must require ’strscan’ as we are using StringScanner#scan to match regular expressions. StringScanner keeps a scan pointer. When scanning for a regular expression, the match must occur at the character after the scan pointer. If a regular expression matches, the scan pointer moves just beyond the last character of the match, ready to scan again from the next character onwards. In other words, StringScanner consumes the input.

require 'strscan'

Example 1: the following code will successfully match the first token of our sample input.

input = StringScanner.new "raining cats cats dogs cats"
grammar = Expression.new /raining/
p grammar.match(input)

So far our expressions are quite useless for real parsing. A real parser must not only consume its input on success, but also backtrack on failure. Not to mention such fancy magic as whitespace skipping.

In order to backtrack on failure, we are going to rewrite Expression#match. This is done by saving the input’s position before attempting to perform a match, and resetting it upon failure. Hence, we rewrite the match method as follows

  save = input.pos
  match = input.scan @regexp
  input.pos = save if match.nil?
  return match

As we are later going to redefine match in each subclass, it would be quite painstaking to repeat this boilerplate code all the time. Patterns to the rescue! There is a nice pattern to be applied here, the Template Method pattern (GOF p 325). We delegate the actual matching to a template method called private_match, which subclasses may override to redefine the actual matching.

class Expression
  def match(input)
    save = input.pos
    input.skip_whitespace
    match = self.private_match(input)
    input.pos = save if match.nil?
    return match
  end
  def private_match(input)
    input.scan @regexp
  end
end

Above, we are even getting the whitespace magic done! Before any attempt to perform private_match, leading whitespace if skipped with a call to StringScanner#skip_whitespace. As StringScanner does not offer such a method by default, we extend the class dynamically as follows:

class StringScanner
  def skip_whitespace
    self.skip /\s+/
  end
end

Next, we are going to subclasses Expression to match sequence, ordered-choice and repetition patterns. This might sound scarier than it actually is. We just follow the recipe of the Interpreter pattern (GOF p 243), adding a new subclass for each kind of pattern.

SequenceExpression matches each of its child-expressions in sequence. If any of the child-expressions fails, the whole expression fails and backtracks.

class SequenceExpression < Expression
  def initialize
    @children = []
  end
  def & expr
    @children << expr; self
  end
  def private_match(input)
    @children.collect { |expr|
      match = expr.match input
      return if match.nil?
      match
    }
  end
end

OrderedChoiceExpression will return the first successful match among its child-expressions. If none of the child-expressions succeeds, the whole expression fails.

NB: for those familiar with context free grammars, there is a gotcha here. The ordered-choice pattern of PEG grammars differs from a CFG’s alternation pattern. Given a series of alternatives, PEG will take the first match, where as CFG will walk both paths and choose the fullest match. For a CFG, E = (ab | a) and F = (a | ab) are the same. For a PEG they differ: both branches of E possibly match, whereas the second branch of F will never match since the first branch is always taken.

class OrderedChoiceExpression < Expression
  def initialize
    @children = []
  end
  def | expr
    @children << expr; self
  end
  def private_match(input)
    @children.each { |expr|
      match = expr.match input
      return match if not match.nil?
    }
    return nil
  end
end

RepetitionExpression tries to match its child-expression as many times as possible. Even if the child-expression does not match at all, the whole expression succeeds.

class RepetitionExpression < Expression
  def initialize(expression)
    @child = expression
  end
  def private_match(input)
    matches = []
    while match = @child.match(input)
      matches << match
    end
    return matches
  end
end

Eventually, we apply the Factory Method pattern (GOF p 107) to add support for these patterns to the base class.

class Expression
  def & expr
    SequenceExpression.new & self & expr
  end
  def | expr
    OrderedChoiceExpression.new | self | expr
  end
  def star
    RepetitionExpression.new self
  end
end

Example 2: the following code will successfully match the entire grammar that has initially been given as an example. First, we create expressions for the terminals “raining”, “cats”, and “dogs”. Then, we combine them using our little factory methods, thus creating a full parser.

input = StringScanner.new("raining cats cats dogs cats")
raining = Expression.new /raining/
cats = Expression.new /cats/
dogs = Expression.new /dogs/
grammar = raining & (dogs | cats).star
p grammar.match(input)

And done!

Well, not completely done. In order to implement a full parsing expression grammar we must add four additional subclasses: OneOrMoreExpression, ZeroOrOneExpression, AndPredicate, and NotPredicate. Their implementation follows the same schema as RepetitionExpression. For details, please refer to the full source code of today’s installment.

6 Responses to “How to write an Executable Grammar”

  1. Toon Says:

    Instead of using some lib to start in the beginning you could also just add a ‘^’ to input-regexp and use that… Something like:

    Expression new "Cats"
    

    which would then do something like:

    def initialize(reg)
      @reg = Regexp.new("^#{reg}")
    end
    
  2. akuhn Says:

    Had this in an earlier version. Alas, regular expressions apply on whole strings only. There is not way to say “starting at”. Which led to quite some ugly boilerplate code. Consume input, advance scan position, ecetera.

    NB strscan belongs to Ruby’s corelib.

  3. For.example() » Blog Archive » How to write an Executable Grammar – II Says:

    [...] we continue the executable grammar series. We turn last time’s executable PEG parser into a DSL, that is, a domain specific language. Being executable, our parser is already half way [...]

  4. For.example() » Blog Archive » How to write an Executable Grammar – III Says:

    [...] we finish our executable grammar series. So far, we implemented an executable PEG parser, and we turned our parser into a DSL language. What remains to be done, is taming the big O [...]

  5. Oscar Says:

    I found the API rather confusing, since matching may return (i) a string, (ii) a list of strings, (iii) nil, (iv) a boolean.
    I rewrote the examples in Squeak so that matching returns either (i) a list of strings, or (ii) nil.
    Otherwise the translation was pretty straightforward. See: http://www.squeaksource.com/PEG.html

  6. Oscar Says:

    Another thing that I found confusing is the use of regexes *within* the expressions. Surely the point is that the executable grammars subsume regexes and all such. In fact, the only place you rely on regexes is to eat whitespace, but you could do this with another parser object …

For.example is Digg proof thanks to caching by WP Super Cache!