Twitter icon.
Random notes on software, programming and languages.
By Adrian Kuhn

How to write an Executable Grammar – III


Today, we finish our executable grammar series. So far, we implemented an executable PEG parser, and turned it into a small DSL language. What remains to be done is taming the big O beast. As it is implemented now, the parser has exponential time complexity. We will use memoization to get the same parser run in linear time.

What does “running in exponential/linear time” mean? In computer science, we use the big O notation to describe how a program’s usage of resources is affected by the size of input. Given a program with an input of size n we use O( f(n) ) to denote the upper bound f(n) of its time or space performance.

  • O(kn) is exponential complexity — for example, solving the traveling salesman problem with dynamic programming has exponential time performance.
  • O(n2) is quadratic complexity — for example, sorting a list of n elements with bubble sort has quadratic time performance.
  • O(n log n) is linearithmic complexity — eg, sorting a list of n elements with heapsort has linearithmic (also called, en-log-en) time performance.
  • O(n) is linear complexity — eg, finding an item in an unsorted list of n elements has linear time performance.
  • O(log n) is logarithmic complexity — eg, finding an item in an sorted list with binary search has logarithmic time performance.
  • O(1) is constant complexity — eg, doing a look-up in a hash table with n entries, or, accessing an element by index in an array list.

Now, lets have a look at our current PEG implementation.

Example 1: given last time’s calculator grammar, the following expression with an input of 25 characters results in 1257 matches against the input string.

Calculator =~ "14 + 15 * 3 + 2 * (5 - 7)"

Example 2: Now, let’s consider an input of twice the size and see what happens. Given the following expression with 50 characters we observe a total of 12082 matches against the input string. That is, twice the length of input results in 10 times the number of matches.

Calculator =~ "14 + 15 * 3 + 2 * (5 - 7 + 14 + 15 * 23 / (5 - 7))"

Given an input string of length n the parser runs with O(kn) time performance, that is exponential complexity. Why? PEG parsers have backtracking through the ordered choice pattern, and unlimited look ahead through the predicate patterns. Thus it may happen that the same expressions are applied multiple times at the same position of input. The further in the input, the more times. Considering our second example, alone at position 47, the number expression is matched 730 times!

Given such terrible time complexity, executable grammers are barely usable in practice. Luckily for us, the matching functions of a PEG parser have no side-effects, and hence, we can use memoization to tame the big O beast.

PEG implementations that use memoization are known as packrat parser.

Memoization is an optimization technique to speed up computer programs. It “remembers” the result of function calls for a given input. Subsequent calls with the same input return the remembered result rather than recalculating. This is a classic trade-off, we gain speed by using more space.

As a limitation, memoization works only if the function can be replaced with its value without changing the program. That is, given the same input, the function must always return the same result. In the jargon of computer science, this is called referential transparency.

Matching a PEG expression against the input is referentially transparent. The input of the matching function is given by the tuple (expression, position) and the result is either a tuple (match, position + n) or failure. As both input string and grammar are immutable, calling the match function with the same input always results in the value. Hence, we go forward an implement

class Expression
  alias :calculate_match :match
  def match(input)
    input.memoize_match(self)
  end
end

class StringScanner
  def bucket
    @memo = [] if @memo.nil?
    bucket = @memo[self.pos]
    @memo[self.pos] = bucket = {} if bucket.nil?
    return bucket
  end
  def memoize_match(expr)
    bucket = self.bucket
    unless bucket.include? expr
      bucket[expr] = nil, self.pos // break recursion!
      match = expr.calculate_match(self)
      bucket[expr] = match, self.pos
    else
      match, self.pos = bucket[expr]
    end
    return match
  end
end

Running the two examples again, we observe 119 matches against the input of 25 characters and 218 matches against the input of 50 characters. Thus, double the inputs results in double the number of matches. Thus, our PEG parser is indeed running with linear complexity.

Well done!

However, the price we pay is linear space complexity with a rather large constant. Why? Memoization remembers all result values for all matches. Using memoization we gain speed by using more space. No black magic without sacrifice.

There are other implementations of PEG grammars that run in linear time without that trade-off. However, they are not based on executable grammers but rather employ a dedicated virtual machine! I might cover that in a future post.

Comments are closed.

For.example is Digg proof thanks to caching by WP Super Cache!