Twitter icon.
Random notes on software, programming and languages.
By Adrian Kuhn

How to write an Executable Grammar – II


Today, we continue the executable grammar series. We turn last time’s executable PEG parser into a DSL, that is, a domain specific language. Being executable, our parser is already half way there, so what is missing?

A DSL is a little language designed for a very narrow purpose. In contrast to general-purpose languages designed to handle arbitrary computational tasks, DSLs are specific to particular domains. There are two ways to create a DSL

  • As an external DSL: inventing syntax and semantics from scratch, and building an interpreter or compiler yourself.
  • As an internal DSL: embedded in an existing general-purpose language by adding or changing methods, operators, and default actions.

An advantage of the second approach is that you save time because you don’t have to write and debug a new language. Leaving you more time to focus on the problem confronting the end-user. Among the disadvantages is that the DSL will be constrained by the syntax and the capabilities of the underlying general-purpose language. Furthermore, building on top of another language often means that the full power of the base language is available to the end-user, which may be a plus or minus depending on the circumstances.

(above partly quoted from Artima: Creating DSLs with Ruby)

The grammar of a parser is one classic example of a DSL. Most parser generator frameworks require the grammar to be written in an external DSL using BNF notation. This article explores the other approach, building an internal DSL on top of Ruby.

Example 1: We will use a calculator with the following PEG grammar and semantics as motivating example for this article. The goal is that its definition in our DSL reads like a grammar rather than like general-purpose source code.

Sum  := Add | Sub | Prod
Add  := Prod & "+" & Sum
Sub  := Prod & "-" & Sum
Prod := Mult | Div | Atom
Mult := Atom & "*" & Prod
Div  := Atom & "/" & Prod
Atom := NUMBER | "(" & Sum & ")"

Add  = { |a, b| a + b }
Sub  = { |a, b| a - b }
Mult = { |a, b| a * b }
Div  = { |a, b| a / b }

Implementing a DSL in Ruby language typically involves

  • Operator overloading
  • Monkey patching
  • Overriding language hooks
  • Scoping and evaluation
  • Keyword parameters

Operator overloading has been covered last time already. In order to compose expressions into sequence respectively ordered-choice patterns, we have defined an ampersand and a pipe operation on the Expressions class.

Example 2: The motivation for operator overloading, and later monkey patching etcetera, is to improve the readability of a DSL. Just compare

number.or(Expression.from_string("(").and(sum).and(Expression.from_string("(")))

with the much more concise and elegant

NUMBER | "(" & Sum & ")"

One thing we can not yet do, is to use a simple string whenever an expression is expected. This is where monkey patching joins the game. A monkey patch extends or modifies existing code without touching the original sources. In any dynamic language, this is a quite natural thing to do. Even at runtime. Whereas with a static beast like Java this can not be done unless resorting to black magic. And even then, Java’s license prohibits specifically any change to the core classes. Which is also why AspectJ does not allow you to patch core classes. Other static languages are more flexible, consider for example C#’s extension methods.

However, monkey patching is not without controversy! Defining new methods on core classes like String does not scale up to multiple users. Imagine that our parser library is used in a large application. Other libraries, or even the application itself, might define the same method on String—but with different behavior! Welcome to yet another path that leads to the gates of madness. There are some solutions underway to resolve this conflict, but a silver bullet has yet to been found.

Even the term “monkey patching” in itself is not without controversy. Some consider it bad choice because it does not sound scientific enough. Others, dislike it for having been coined somewhere else.

Anyway, we take the path to madness and send our monkeys to work. In order to disguise strings as expressions, we add two helper methods to Expression and eventually extend the String class as follows

class Expression
  def to_expr
    self
  end
  def self.from_string(term)
    self.new Regexp.new("^#{Regexp.escape(term)}")
  end
end

class String
  def to_expr
    Expression.from_string(self)
  end
  def & expr
    self.to_expr & expr
  end
  def | expr
    self.to_expr | expr
  end
end

This set of changes allows strings to be uses as the left-hand side argument of any sequence or ordered-choice pattern. In order to achieve the same for right-hand side arguments, we must redefine SequenceExpression#& and OrderedChoiceExpression#| as well

class SequenceExpression
  def & expr
    @children << expr.to_expr; self
  end
end

class OrderedChoiceExpression
  def | expr
    @children << expr.to_expr; self
  end
end

Well done, monkeys!

A parser is not complete without semantics. No big deal here, we just define the class Parser to be an expression with associated lambda function

class Parser < Expression
  attr_accessor :expression, :semantics
  def initialize(expression=nil)
    @expression = expression
    @semantics = lambda { |args| args }
  end
  def private_match(input)
    match = @expression.match(input)
    return if match.nil?
    @semantics.call(*match)
  end
end

Example 3: Time for a review. Let’s put it all together, and see what we have achieved so far and what might still be missing. To keep things simple, we implement an addition only

number = Parser.new
number.expression = Expression.new /(-)?\d+/
number.semantics = lambda { |str| str.to_i }
add = Parser.new
add.expression = number & "+" & number
add.semantics = lambda { |a, x, b| a + b }
input = StringScanner.new "3 + 4"
puts add.match(input)

Yikes, how source codish! We would rather like to write something as concise as

add := number & "+" & number
add = { |a, b| a + b }

Magic to the rescue! Next on our list of DSL ingredients are overriding a language hook and scoping. The first and most important thing to be done now, is to hide all the parser creation clutter. We just want to spell out an expression’s name, and—tada, magic!—a new parser is to be created the first time and later the same parser to be returned.

We achieve this by overriding a language hook. A language hook is a special method that gets notified of language-related events. The most important language hooks in Ruby are

  • Kernel#method_missing
  • Module#const_missing
  • Module#method_added
  • Module#method_removed
  • Module#included
  • Module#extended

By default, the first two hooks raise an error and the remaining hooks do nothing. As parsing expressions are most like constants, we override const_missing to assign a newly created parsing expression whenever a constant is missing.

module Grammar
  def const_missing(symb)
    const_set symb, Parser.new
  end
end

Since Ruby constants must start with an uppercase letter, the parsing expressions of our DSL have the some limitation. We can live with that. Now, whenever an uninitialized constant name is used, the custom hook is called and assignes a newly created parsing expression to the so far uninitialized constant. If the same constant name is used later, it is on longer uninitialized but refers to the proper parsing expression. However, we do not want this magic to happen for all Ruby code, but only for our little grammar. Hence we use a module to limit the scope of our hook.

NB: it would be nice to limit the scope of monkey patching as well. Alas, this is not as simple. Method definitions, and thus monkey patches, are always applied global rather than being properly scoped.

Furthermore, we use module methods to define common expressions, such as number. Another module can later extend Grammar in order to import both constant resolution and predefined expressions.

module Grammar
  def number
    number = Parser.new
    number.expression = Expression.new /^\d+/
    number.semantics = lambda { |string| string.to_i }
    number
  end
end

And, we define concise aliases for the setters of a parser’s expression and semantics. Again, our purpose is to improve ease of writing and reading. For the same purpose, we define Grammar#=~ for expression matching and decide that Root shall be to the entry point of any grammar module.

class Parser
  def <= expr
    self.expression = expr
  end
  def do &lambda
    self.semantics = lambda
  end
end

module Grammar
  def =~ string
    self::Root.match StringScanner.new(input)
  end
end

Example 4: Sweetening things up with all that sugar, we get

module Add
  extend Grammar
  Root <= number & "+" & number
  Root.do { |a, plus, b| a + b }
end

puts Add =~ "3 + 4"

Done … or rather, almost done. There is one more issue left. The lambda expression for our addition still looks quite ugly with this superfluous plus parameter. A sequence pattern returns all its children as match. Therefore, the addition results in [3, "+", 4] instead of just [3, 4]. It would be nice to omit the plus.

But even worse, consider the calculator’s "(" & Sum & ")" expression, which we must associate with the rather pointless { |a,b,c| b } lambda procedure. Here, it would be nice to omit the parentheses.

A quick look at the complete grammar of the initial calculator reveals that for any sequence pattern, the string children are to be omitted when doing semantics. Hence we introduce an @omit instance variable in Expression and redefine SequenceExpressions#match to omit omitable children. Please refer to the full source code for details.

Now, our DSL is finally done.

Example 5: Finally, we can write the initial example almost as given

module Calculator
  extend Grammar

  Root    <= Sum
  Sum     <= Add | Sub | Prod
  Add     <= Prod & "+" & Sum
  Sub     <= Prod & "-" & Sum
  Prod    <= Mult | Div | Atom
  Mult    <= Atom & "*" & Prod
  Div     <= Atom & "/" & Prod
  Atom    <= number | Parens
  Parens  <= "(" & Sum & ")"

  Add.do    { |a, b| a + b }
  Sub.do    { |a, b| a - b }
  Mult.do   { |a, b| a * b }
  Div.do    { |a, b| a / b }

end

puts Calculator =~ "14 + 15 * 3 + 2 * (5 - 7)" # => 55

I hope to have shown that designing your own parser framework is not only fun but also rather simple. And, that there is more in DSLs than excessive abuse of keyword parameters. In the final installement of this series, I will show how to use memoization to tame the big O beast.

Please checkout the the full source code.

One Response to “How to write an Executable Grammar – II”

  1. For.example() » Blog Archive » How to write an Executable Grammar – III Says:

    [...] we finish our executable grammar series. So far, we implemented an executable PEG parser, and we turned our parser into a DSL language. What remains to be done, is taming the big O [...]

For.example is Digg proof thanks to caching by WP Super Cache!